{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Feature Engineering Notebook Two: Feature Selection**  \n",
    "*Author: Yingxiang Chen, Zihan Yang*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Reference**\n",
    "- https://scikit-learn.org/stable/modules/feature_selection.html\n",
    "- https://machinelearningmastery.com/an-introduction-to-feature-selection/\n",
    "- https://dcor.readthedocs.io/en/latest/functions/dcor.independence.distance_covariance_test.html\n",
    "- https://en.wikipedia.org/wiki/Distance_correlation\n",
    "- https://stats.stackexchange.com/questions/56881/whats-the-relationship-between-r2-and-f-test\n",
    "- https://en.wikipedia.org/wiki/Mutual_information\n",
    "- https://libguides.library.kent.edu/SPSS/ChiSquare\n",
    "- https://chrisalbon.com/machine_learning/feature_selection/anova_f-value_for_feature_selection/\n",
    "- https://online.stat.psu.edu/stat414/node/218/\n",
    "- http://featureselection.asu.edu/algorithms.php\n",
    "- https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection\n",
    "- Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence, (8), 1226-1238.\n",
    "- Hall, M. A., & Smith, L. A. (1999, May). Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In FLAIRS conference (Vol. 1999, pp. 235-239).\n",
    "- Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 856-863).\n",
    "- Zhao, Z., & Liu, H. (2007, June). Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th international conference on Machine learning (pp. 1151-1157). ACM.\n",
    "- Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning, 53(1-2), 23-69.\n",
    "- https://machinelearningmastery.com/an-introduction-to-feature-selection/\n",
    "- http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/\n",
    "- https://cloud.tencent.com/developer/article/1087796 [Chinese]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Feature-Selection\" data-toc-modified-id=\"Feature-Selection-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Feature Selection</a></span><ul class=\"toc-item\"><li><span><a href=\"#Filter-Methods\" data-toc-modified-id=\"Filter-Methods-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Filter Methods</a></span><ul class=\"toc-item\"><li><span><a href=\"#Univariate-Filter-Methods\" data-toc-modified-id=\"Univariate-Filter-Methods-1.1.1\"><span class=\"toc-item-num\">1.1.1&nbsp;&nbsp;</span>Univariate Filter Methods</a></span><ul class=\"toc-item\"><li><span><a href=\"#Variance-Threshold\" data-toc-modified-id=\"Variance-Threshold-1.1.1.1\"><span class=\"toc-item-num\">1.1.1.1&nbsp;&nbsp;</span>Variance Threshold</a></span></li><li><span><a href=\"#Pearson-Correlation-(regression-problem)\" data-toc-modified-id=\"Pearson-Correlation-(regression-problem)-1.1.1.2\"><span class=\"toc-item-num\">1.1.1.2&nbsp;&nbsp;</span>Pearson Correlation (regression problem)</a></span></li><li><span><a href=\"#Distance-Correlation-(regression-problem)\" data-toc-modified-id=\"Distance-Correlation-(regression-problem)-1.1.1.3\"><span class=\"toc-item-num\">1.1.1.3&nbsp;&nbsp;</span>Distance Correlation (regression problem)</a></span></li><li><span><a href=\"#F-Score-(regression-problem)\" data-toc-modified-id=\"F-Score-(regression-problem)-1.1.1.4\"><span class=\"toc-item-num\">1.1.1.4&nbsp;&nbsp;</span>F-Score (regression problem)</a></span></li><li><span><a href=\"#Mutual-Information-(regression-problem)\" data-toc-modified-id=\"Mutual-Information-(regression-problem)-1.1.1.5\"><span class=\"toc-item-num\">1.1.1.5&nbsp;&nbsp;</span>Mutual Information (regression problem)</a></span></li><li><span><a href=\"#Chi-squared-Statistics-(classification-problem)\" data-toc-modified-id=\"Chi-squared-Statistics-(classification-problem)-1.1.1.6\"><span class=\"toc-item-num\">1.1.1.6&nbsp;&nbsp;</span>Chi-squared Statistics (classification problem)</a></span></li><li><span><a href=\"#F-Score-(classification-problem)\" data-toc-modified-id=\"F-Score-(classification-problem)-1.1.1.7\"><span class=\"toc-item-num\">1.1.1.7&nbsp;&nbsp;</span>F-Score (classification problem)</a></span></li><li><span><a href=\"#Mutual-Information-(classification-problem)\" data-toc-modified-id=\"Mutual-Information-(classification-problem)-1.1.1.8\"><span class=\"toc-item-num\">1.1.1.8&nbsp;&nbsp;</span>Mutual Information (classification problem)</a></span></li></ul></li><li><span><a href=\"#Multivariate-Filter-Methods\" data-toc-modified-id=\"Multivariate-Filter-Methods-1.1.2\"><span class=\"toc-item-num\">1.1.2&nbsp;&nbsp;</span>Multivariate Filter Methods</a></span><ul class=\"toc-item\"><li><span><a href=\"#Max-Relevance-Min-Redundancy-(mRMR)\" data-toc-modified-id=\"Max-Relevance-Min-Redundancy-(mRMR)-1.1.2.1\"><span class=\"toc-item-num\">1.1.2.1&nbsp;&nbsp;</span>Max-Relevance Min-Redundancy (mRMR)</a></span></li><li><span><a href=\"#Correlation-based-Feature-Selection-(CFS)\" data-toc-modified-id=\"Correlation-based-Feature-Selection-(CFS)-1.1.2.2\"><span class=\"toc-item-num\">1.1.2.2&nbsp;&nbsp;</span>Correlation-based Feature Selection (CFS)</a></span></li><li><span><a href=\"#Fast-Correlation-based-Filter-(FCBF)\" data-toc-modified-id=\"Fast-Correlation-based-Filter-(FCBF)-1.1.2.3\"><span class=\"toc-item-num\">1.1.2.3&nbsp;&nbsp;</span>Fast Correlation-based Filter (FCBF)</a></span></li><li><span><a href=\"#ReliefF\" data-toc-modified-id=\"ReliefF-1.1.2.4\"><span class=\"toc-item-num\">1.1.2.4&nbsp;&nbsp;</span>ReliefF</a></span></li><li><span><a href=\"#Spectral-Feature-Selection-(SPEC)\" data-toc-modified-id=\"Spectral-Feature-Selection-(SPEC)-1.1.2.5\"><span class=\"toc-item-num\">1.1.2.5&nbsp;&nbsp;</span>Spectral Feature Selection (SPEC)</a></span></li></ul></li></ul></li><li><span><a href=\"#Wrapper-Methods\" data-toc-modified-id=\"Wrapper-Methods-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Wrapper Methods</a></span><ul class=\"toc-item\"><li><span><a href=\"#Deterministic-Algorithms\" data-toc-modified-id=\"Deterministic-Algorithms-1.2.1\"><span class=\"toc-item-num\">1.2.1&nbsp;&nbsp;</span>Deterministic Algorithms</a></span><ul class=\"toc-item\"><li><span><a href=\"#Recursive-Feature-Elimination-(SBS)\" data-toc-modified-id=\"Recursive-Feature-Elimination-(SBS)-1.2.1.1\"><span class=\"toc-item-num\">1.2.1.1&nbsp;&nbsp;</span>Recursive Feature Elimination (SBS)</a></span></li></ul></li><li><span><a href=\"#Randomized-Algorithms\" data-toc-modified-id=\"Randomized-Algorithms-1.2.2\"><span class=\"toc-item-num\">1.2.2&nbsp;&nbsp;</span>Randomized Algorithms</a></span><ul class=\"toc-item\"><li><span><a href=\"#Simulated-Annealing-(SA)\" data-toc-modified-id=\"Simulated-Annealing-(SA)-1.2.2.1\"><span class=\"toc-item-num\">1.2.2.1&nbsp;&nbsp;</span>Simulated Annealing (SA)</a></span></li><li><span><a href=\"#Genetic-Algorithm-(GA)\" data-toc-modified-id=\"Genetic-Algorithm-(GA)-1.2.2.2\"><span class=\"toc-item-num\">1.2.2.2&nbsp;&nbsp;</span>Genetic Algorithm (GA)</a></span></li></ul></li></ul></li><li><span><a href=\"#Embedded-Methods\" data-toc-modified-id=\"Embedded-Methods-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Embedded Methods</a></span><ul class=\"toc-item\"><li><span><a href=\"#Regulization-Based-Methods\" data-toc-modified-id=\"Regulization-Based-Methods-1.3.1\"><span class=\"toc-item-num\">1.3.1&nbsp;&nbsp;</span>Regulization Based Methods</a></span><ul class=\"toc-item\"><li><span><a href=\"#Lasso-Regression-(Linear-Regression-with-L1-Norm)\" data-toc-modified-id=\"Lasso-Regression-(Linear-Regression-with-L1-Norm)-1.3.1.1\"><span class=\"toc-item-num\">1.3.1.1&nbsp;&nbsp;</span>Lasso Regression (Linear Regression with L1 Norm)</a></span></li><li><span><a href=\"#Logistic-Regression-(with-L1-Norm)\" data-toc-modified-id=\"Logistic-Regression-(with-L1-Norm)-1.3.1.2\"><span class=\"toc-item-num\">1.3.1.2&nbsp;&nbsp;</span>Logistic Regression (with L1 Norm)</a></span></li><li><span><a href=\"#LinearSVR/-LinearSVC\" data-toc-modified-id=\"LinearSVR/-LinearSVC-1.3.1.3\"><span class=\"toc-item-num\">1.3.1.3&nbsp;&nbsp;</span>LinearSVR/ LinearSVC</a></span></li></ul></li><li><span><a href=\"#Tree-Based-Methods\" data-toc-modified-id=\"Tree-Based-Methods-1.3.2\"><span class=\"toc-item-num\">1.3.2&nbsp;&nbsp;</span>Tree Based Methods</a></span></li></ul></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After data preprocessing, we have generated a large number of new variables (recall by one-hot encoding, a large number of variables are generated and contain only 0 or 1). But in fact, some of the newly generated variables may be redundant. On the one hand, they might not contain any useful information and therefore cannot improve model performance. On the other hand, these extra variables will consume a lot of memory and computing power when building the model. Therefore, we should perform feature selection and select a subset of features for modeling."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The filter method uses some statistics measures or hypothesis test results to assign scores to each feature. Features with higher scores tend to be more important and should be included in the subsets. Below is a sample ML pipeline based on train-validation-testset split method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-01-31T23:03:05.919797Z",
     "start_time": "2020-01-31T23:03:05.805176Z"
    }
   },
   "source": [
    "![image](./images/Filter_Pipeline.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Univariate Filter Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Univariate filter methods select the best feature based on univariate statistical tests. It considers the feature independently or with regard to the target variable only."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Variance Threshold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The variance approach simply removes all the features that have variance below a certain threshold. For example, a feature with zero variance (features that have the same value in all observations) should be removed because this feature can not examine any variance in the target variance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:26:45.332184Z",
     "start_time": "2020-03-30T03:26:43.478770Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.feature_selection import VarianceThreshold\n",
    "\n",
    "# create some synthesized dataset\n",
    "train_set = np.array([[1,2,3],[1,4,7],[1,4,9]]) # the first feature have zero variance\n",
    "# array([[1, 2, 3],\n",
    "#        [1, 4, 7],\n",
    "#        [1, 4, 9]])\n",
    "\n",
    "test_set = np.array([[3,2,3],[1,2,7]]) # the second feature have zero variance\n",
    "# array([[3, 2, 3],\n",
    "#        [1, 2, 7]])\n",
    "\n",
    "selector = VarianceThreshold()\n",
    "selector.fit(train_set) # fit on trainset\n",
    "transformed_train = selector.transform(train_set) # transform train set\n",
    "# the first feature has been removed\n",
    "# array([[2, 3],\n",
    "#        [4, 7],\n",
    "#        [4, 9]])\n",
    "\n",
    "transformed_test = selector.transform(test_set) # transform test set\n",
    "# array([[2, 3],\n",
    "#        [2, 7]])\n",
    "# although in the test set the second features has zero variance\n",
    "# but according to train set, we should remove the first feature only"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-01-13T17:20:24.634631Z",
     "start_time": "2020-01-13T17:20:24.629774Z"
    }
   },
   "source": [
    "#### Pearson Correlation (regression problem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pearson Correlation is generally used to measure the linear correlection between two **continuous** features. It can also measure the linear correlation between binary features and the target variable. And the categorical variable can be converted into binary features by one-hot encoding.   \n",
    "\n",
    "Formula:  \n",
    "  \n",
    "$r = \\frac{\\sum_{i=1}^{n}(X_i-\\bar{X})(Y_i-\\bar{Y})}{\\sqrt{(X_i-\\bar{X})^2}\\sqrt{(Y_i-\\bar{Y})^2}}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:26:45.410045Z",
     "start_time": "2020-03-30T03:26:45.334519Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from scipy.stats import pearsonr\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# load dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "# in this dataset, both X and y are continuous, so we can use Pearson Correlation to select features\n",
    "\n",
    "# use the first 15000 obs as train_set\n",
    "# the rest obs as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# there is no pre-built Pearson function in sklearn\n",
    "# we need to use scipy.stats.pearsonr (can only compute the pearsonr between two features) \n",
    "# to create a multivariate version of Pearson function that can be used in SelectKBest\n",
    "\n",
    "def udf_pearsonr(X, y):\n",
    "    result = np.array([pearsonr(x, y) for x in X.T]) # list of (pearsonr, p-value)\n",
    "    return np.absolute(result[:,0]), result[:,1] \n",
    "\n",
    "selector = SelectKBest(udf_pearsonr, k=2) # k => number of top features to select\n",
    "selector.fit(train_set, train_y) # fit on trainset\n",
    "transformed_train = selector.transform(train_set) # transform trainset\n",
    "transformed_train.shape #(15000, 2), select the 1st and 7th features\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,6]]) \n",
    "\n",
    "transformed_test = selector.transform(test_set)\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,6]]);\n",
    "# the 1st and 7th features are selected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:26:45.419519Z",
     "start_time": "2020-03-30T03:26:45.412586Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The absolute value of the correlation between the 1 feature and target is 0.7,p-value is 0.0\n",
      "The absolute value of the correlation between the 2 feature and target is 0.07,p-value is 0.0\n",
      "The absolute value of the correlation between the 3 feature and target is 0.14,p-value is 0.0\n",
      "The absolute value of the correlation between the 4 feature and target is 0.04,p-value is 0.0\n",
      "The absolute value of the correlation between the 5 feature and target is 0.02,p-value is 0.011\n",
      "The absolute value of the correlation between the 6 feature and target is 0.05,p-value is 0.0\n",
      "The absolute value of the correlation between the 7 feature and target is 0.23,p-value is 0.0\n",
      "The absolute value of the correlation between the 8 feature and target is 0.08,p-value is 0.0\n"
     ]
    }
   ],
   "source": [
    "# validate the result\n",
    "for idx in range(train_set.shape[1]):\n",
    "    pea_score, p_value = pearsonr(train_set[:,idx], train_y)\n",
    "    print(f\"The absolute value of the correlation between the {idx + 1} feature and target is {round(np.abs(pea_score),2)},p-value is {round(p_value,3)}\")\n",
    "# so we should select the 1st and 7th features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Distance Correlation (regression problem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Distance correlation measures the dependence between two continuous features. Unlike the Pearson correlation, distance correlation measures both linear and nonlinear relationships between two variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "  \n",
    "Firstly, compute the (n x n) distance matrices dX with $dX_{ij}$ as elements and dY with $dY_{ij}$ as elementns containing all pairwise distances. $dX_{ij}$ is the distance between observation i and observation j:    \n",
    "  \n",
    "$dX_{ij} = \\left \\| X_i - X_j \\right \\|$  \n",
    "$dY_{ij} = \\left \\| Y_i - Y_j \\right \\|$\n",
    "  \n",
    "Secondly, we calculate the doubly centered distances as below and update the distance matrices. $\\bar{X_i}$ is the i-th row mean of distance matrices dX, $\\bar{X_j}$ the j-th column mean of distance matrices dX, $\\sum_i^N \\sum_j^N dX_{ij}$ is the grand mean.\n",
    "  \n",
    "$dX_{ij} = dX_{ij} - \\bar{X_i} - \\bar{X_j} + \\frac{1}{N^2} \\sum_i^N \\sum_j^N dX_{ij}$  \n",
    "$dY_{ij} = dY_{ij} - \\bar{Y_i} - \\bar{Y_j} + \\frac{1}{N^2} \\sum_i^N \\sum_j^N dY_{ij}$   \n",
    "  \n",
    "Then, we compute the sample distance covariance/ variance as below:  \n",
    "  \n",
    "$dCov^2 (X, Y) = \\frac{1}{N^2} \\sum_{i}^{N} \\sum_{j}^{N} dX_{ij}dY_{ij}$  \n",
    "$dVar^2(X) = Cov^2_D (X, X)$  \n",
    "\n",
    "Finally, the distance correlation $dCor(X,Y)$ is as below:  \n",
    "  \n",
    "$dCor(X,Y) = \\frac{dCov_D(X, Y)}{\\sqrt{dVar^2(X)}\\sqrt{dVar^2(Y)} }$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:35:52.279299Z",
     "start_time": "2020-03-30T03:26:45.421784Z"
    }
   },
   "outputs": [],
   "source": [
    "from dcor import distance_correlation\n",
    "from dcor.independence import distance_covariance_test\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# load dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "# in this dataset, both X and y are continuous, so we can use distance Correlation to select features\n",
    "\n",
    "# use the first 15000 obs as train_set\n",
    "# the rest obs as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# there is no pre-built Distance Correlation function in sklearn\n",
    "# we need to use dcor.distance_correlation (can only compute the distance_correlation between two features) \n",
    "# to create a multivariate version of distance_correlation function that can be used in SelectKBest\n",
    "\n",
    "def udf_dcorr(X, y):\n",
    "    result = np.array([[distance_correlation(x, y), \n",
    "                        distance_covariance_test(x,y)[0]]for x in X.T]) # list of (pearsonr, p-value)\n",
    "    return result[:,0], result[:,1]\n",
    "\n",
    "selector = SelectKBest(udf_dcorr, k=2) # k => number of top features to select\n",
    "selector.fit(train_set, train_y) # fit on trainset\n",
    "transformed_train = selector.transform(train_set) # transform trainset \n",
    "transformed_train.shape #(15000, 2), select the 1st and 3rd features\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,2]])\n",
    "\n",
    "transformed_test = selector.transform(test_set)\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,2]]);\n",
    "# the 1st and 3rd features are selected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:51.758666Z",
     "start_time": "2020-03-30T03:35:52.333149Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The d-correlation between the 1 feature and target is 0.66, p-value is 1.0\n",
      "The d-correlation between the 2 feature and target is 0.07, p-value is 1.0\n",
      "The d-correlation between the 3 feature and target is 0.31, p-value is 1.0\n",
      "The d-correlation between the 4 feature and target is 0.12, p-value is 1.0\n",
      "The d-correlation between the 5 feature and target is 0.08, p-value is 1.0\n",
      "The d-correlation between the 6 feature and target is 0.29, p-value is 1.0\n",
      "The d-correlation between the 7 feature and target is 0.25, p-value is 1.0\n",
      "The d-correlation between the 8 feature and target is 0.19, p-value is 1.0\n"
     ]
    }
   ],
   "source": [
    "# validate the result\n",
    "for idx in range(train_set.shape[1]):\n",
    "    d_score = distance_correlation(train_set[:,idx], train_y)\n",
    "    p_value = distance_covariance_test(train_set[:,idx], train_y)[0]\n",
    "    print(f\"The d-correlation between the {idx + 1} feature and target is {round(d_score,2)}, p-value is {round(p_value,3)}\")\n",
    "# so we should select the 1st and 3rd features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### F-Score (regression problem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The F-Score reports whether any of the independent variables in a linear regression model are significant. Specifically, suppose we have p features, we construct p univariate linear regression for each feature separately, each regress the target variable with the i-th feature and a constant only. Then we can report the F-Score of each linear model which captures the linear relationship between the ith feature and the target variable. The null hypothesis for F-Score is that the feature is not related to the target variable. So we should select features that have higher F-Score (more likely to reject the null hypothesis)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "\n",
    "$F = \\frac{(SST - SSR)/(p - 1)}{SSR/(n - p)} =  \\frac{SST - SSR}{SSR/(n - 2)} =  \\frac{R^2}{(1 - R^2)(n - 2)} = \\frac{\\rho ^2}{(1 - \\rho ^2)(n - 2)}$  \n",
    "  \n",
    "where:  \n",
    "\n",
    "$SST = \\sum_{i=1}^{n}(y_i - \\overline{y}) ^2$  \n",
    "  \n",
    "$\\overline{y} = \\frac{1}{n} \\sum_{i=1}^{n}y_i$  \n",
    "  \n",
    "$SSR = \\sum_{i=1}^{n}(\\widehat{y}_i - \\overline{y})^2$  \n",
    "  \n",
    "$\\widehat{y}_i$ is the predicted value by the model\n",
    "  \n",
    "SST is the total sum of squares, SSR is the residual sum of squares, p is the number of predictors (including the constant, so p = 2 in our case), $\\rho$ is the correlation coefficient between feature i and the target variable and n is the number of observations. Since in the linear model, there is only one non-constant variable, so $\\rho^2 = R^2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:51.946039Z",
     "start_time": "2020-03-30T03:43:51.776859Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import f_regression\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# load dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "# in this dataset, both X and y are continuous, so we can use F-score to select features\n",
    "\n",
    "# use the first 15000 obs as train_set\n",
    "# the rest obs as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# use pre-built f-score function in sklearn\n",
    "\n",
    "selector = SelectKBest(f_regression, k=2) # k => number of top features to select\n",
    "selector.fit(train_set, train_y) # fit on trainset \n",
    "transformed_train = selector.transform(train_set) # transform trainset\n",
    "transformed_train.shape #(15000, 2), select the 1st and 7th features\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,6]]) \n",
    "\n",
    "transformed_test = selector.transform(test_set)\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,6]]);\n",
    "# the 1st and 7th features are selected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:51.971983Z",
     "start_time": "2020-03-30T03:43:51.949208Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The F-Score between the 1 feature and target is 14111.79, p-value is 0.0\n",
      "The F-Score between the 2 feature and target is 71.99, p-value is 0.0\n",
      "The F-Score between the 3 feature and target is 317.04, p-value is 0.0\n",
      "The F-Score between the 4 feature and target is 23.93, p-value is 0.0\n",
      "The F-Score between the 5 feature and target is 6.54, p-value is 0.011\n",
      "The F-Score between the 6 feature and target is 35.93, p-value is 0.0\n",
      "The F-Score between the 7 feature and target is 846.61, p-value is 0.0\n",
      "The F-Score between the 8 feature and target is 98.06, p-value is 0.0\n"
     ]
    }
   ],
   "source": [
    "# validate the result\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score, p_value = f_regression(train_set[:,idx].reshape(-1,1), train_y)\n",
    "    print(f\"The F-Score between the {idx + 1} feature and target is {round(score[0],2)}, p-value is {round(p_value[0],3)}\")\n",
    "# so we should select the 1st and 7th features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mutual Information (regression problem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mutual information measures the dependency between the two variables, that is, the reduction in entropy after knowing the information of another variable.   \n",
    "\n",
    "MI is equal to zero if and only if two random variables are independent, and higher values reflect higher dependency. Compared with Pearson correlation & F-Score, it also captures non-linear relationships."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "  \n",
    "- For discrete distributions (for both x and y):  \n",
    "  \n",
    "    $I(x, y) = H(Y) - H(Y|X) = \\sum_{x\\in \\mathit{X}}  \\sum_{x\\in \\mathit{Y}} \\textit{p}_{(X,Y)}(x,y) \\textrm{log}(\\frac{\\textit{p}_{(X,Y)}(x,y)}{\\textit{p}_{X}(x)\\textit{p}_{Y}(y)})$  \n",
    "\n",
    "    Where $\\textit{p}_{(X,Y)}(x,y)$ is the joint probability mass function (PMF) of x and y, $\\textit{p}_{X}(x)$ is the PMF of x.  \n",
    "  \n",
    "- For continuous distributions (for both x and y):  \n",
    "\n",
    "    $I(X, Y) = H(Y) - H(Y|X) = \\int_X \\int_Y  \\textit{p}_{(X,Y)}(x,y) \\textrm{log}(\\frac{\\textit{p}_{(X,Y)}(x,y)}{\\textit{p}_{X}(x)\\textit{p}_{Y}(y)}) \\, \\, dx dy$  \n",
    "    \n",
    "    Where $\\textit{p}_{(X,Y)}(x,y)$ is the joint probability density function (PDF) of x and y, $\\textit{p}_{X}(x)$ is the PDF of x.  \n",
    "  \n",
    "  \n",
    "But in reality, is it likely that one of x and y is discrete variable and the another is continuous variable. So in sklearn, it implement the nonparametric methods based on entropy estimation from k-nearest neighbors distances proposed in [1] and [2].  \n",
    "  \n",
    "  \n",
    "[1] A. Kraskov, H. Stogbauer and P. Grassberger, “Estimating mutual information”. Phys. Rev. E 69, 2004.  \n",
    "[2] B. C. Ross “Mutual Information between Discrete and Continuous Data Sets”. PLoS ONE 9(2), 2014. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:54.228223Z",
     "start_time": "2020-03-30T03:43:51.980915Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import mutual_info_regression\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# load dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "# in this dataset, both X and y are continuous, so we can use MI to select features\n",
    "\n",
    "# use the first 15000 obs as train_set\n",
    "# the rest obs as test_set\n",
    "train_set = X[0:15000,:].astype(float)\n",
    "test_set = X[15000:,].astype(float)\n",
    "train_y = y[0:15000].astype(float)\n",
    "\n",
    "# since n_neighbors in the KNN is also a very important parameters\n",
    "# so we write a new MI function function based on pre-built MI function in sklearn\n",
    "# to allow more flexibility\n",
    "\n",
    "def udf_MI(X, y):\n",
    "    result = mutual_info_regression(X, y, n_neighbors = 5) # user_defined n_neighbors\n",
    "    return result\n",
    "\n",
    "selector = SelectKBest(udf_MI, k=2) # k => number of top features to select\n",
    "selector.fit(train_set, train_y) # fit on trainset\n",
    "transformed_train = selector.transform(train_set) # transform trainset\n",
    "transformed_train.shape #(15000, 2), select the 1st and 8th features\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,7]]) # return True\n",
    "\n",
    "transformed_test = selector.transform(test_set) # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,7]]);\n",
    "# the 1st and 8th features are selected"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.082315Z",
     "start_time": "2020-03-30T03:43:54.230729Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The MI between the 1 feature and target is 0.37\n",
      "The MI between the 2 feature and target is 0.03\n",
      "The MI between the 3 feature and target is 0.1\n",
      "The MI between the 4 feature and target is 0.03\n",
      "The MI between the 5 feature and target is 0.02\n",
      "The MI between the 6 feature and target is 0.09\n",
      "The MI between the 7 feature and target is 0.37\n",
      "The MI between the 8 feature and target is 0.46\n"
     ]
    }
   ],
   "source": [
    "# validate the result\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score = mutual_info_regression(train_set[:,idx].reshape(-1,1), train_y, n_neighbors = 5)\n",
    "    print(f\"The MI between the {idx + 1} feature and target is {round(score[0],2)}\")\n",
    "# so we should select the 1st and 8th features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Chi-squared Statistics (classification problem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Chi-Square Statistic determines whether there is a relationship between categorical variables. Sklearn provides chi2 function to calculate Chi-square. The input of this function should be booleans or frequencies. The null hypothesis is that two variables are independent, so the higher the chi-square, the higher the probability that the two variables are correlated, so we should select features that have higher chi-square scores."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "  \n",
    "$\\chi^2 = \\sum_{i=1}^{r} \\sum_{j=1}^{c} \\frac{(O_{i,j} - E_{i,j})^2}{E_{i,j}} = n \\sum_{i,j} p_ip_j(\\frac{\\frac{O_{i,j}}{n} - p_i p_j}{p_i p_j})^2$  \n",
    "  \n",
    "$O_{i,j}$ is the number of observations that have the i-th category value in feature X and\n",
    "the j-th category value in feature Y. $E_{i,j}$ is the expected number of observations that have the i-th category value in feature X and\n",
    "the j-th category value in feature Y. n is the number of observations in the dataset. $p_i$ is the probability of having the i-th category value in feature X, $p_j $ is the probability of having the j-th category value in feature Y.  \n",
    " \n",
    "**We did some research on the original source code of sklearn. And we figure out that in fact, the chi-square statistic calculated using chi2 function in sklearn is not the real Chi-square statistic**. When the input feature is a boolean feature, the value calculated by chi2 function only considers the situation when the feature value is True (we will elaborate this later). This has the advantage that the sum of the output calculated by the chi2 equation for all the boolean variables generated by one-hot encoding will equal to the chi-square statistics of the original variable.  \n",
    "  \n",
    "For example, suppose a variable I has three possible categorical values: 0, 1, and 2. After the one-hot encoding, three new boolean variables will be generated. The sum of the chi2 function output of these three boolean variables equals the chi-square statistics calculated directly between the original variable I and the dependent variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Explain how sklearn calculate chi2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.149027Z",
     "start_time": "2020-03-30T03:43:56.083963Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Output</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>J</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>J</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>J</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>C</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Type  Output\n",
       "0     J       0\n",
       "1     J       1\n",
       "2     J       0\n",
       "3     B       2\n",
       "4     B       0\n",
       "5     B       1\n",
       "6     C       0\n",
       "7     C       0\n",
       "8     C       1\n",
       "9     C       2\n",
       "10    C       2"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# generate a dataset\n",
    "import pandas as pd\n",
    "sample_dict = {'Type': ['J','J','J',\n",
    "                        'B','B','B',\n",
    "                        'C','C','C','C','C'], \n",
    "               'Output': [0, 1, 0, \n",
    "                          2, 0, 1,  \n",
    "                          0, 0, 1, 2, 2,]}\n",
    "sample_raw = pd.DataFrame(sample_dict)\n",
    "sample_raw # raw data, `output` is our target variable, `Type` is the input features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.170168Z",
     "start_time": "2020-03-30T03:43:56.150584Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([0.17777778, 0.42666667, 1.15555556]),\n",
       " array([0.91494723, 0.8078868 , 0.56114397]))"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# one-hot encoding the dataset and generate boolean variable\n",
    "# use sklearn chi2 to calculate chi2 value for each boolean variables\n",
    "\n",
    "sample = pd.get_dummies(sample_raw)\n",
    "from sklearn.feature_selection import chi2\n",
    "chi2(sample.values[:,[1,2,3]],sample.values[:,[0]])\n",
    "# the first row is the chi2 value for each boolean variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.185913Z",
     "start_time": "2020-03-30T03:43:56.171678Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Type</th>\n",
       "      <th>Output</th>\n",
       "      <th>Count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>B</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>C</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>C</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>C</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>J</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>J</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Type  Output  Count\n",
       "0    B       0      1\n",
       "1    B       1      1\n",
       "2    B       2      1\n",
       "3    C       0      2\n",
       "4    C       1      1\n",
       "5    C       2      2\n",
       "6    J       0      2\n",
       "7    J       1      1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# now calculate the chi2 between original feature `Type` and target variable `Output`\n",
    "# calculate the contingency table first\n",
    "obs_df = sample_raw.groupby(['Type','Output']).size().reset_index()\n",
    "obs_df.columns = ['Type','Output','Count']\n",
    "obs_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the contingency table is as below:\n",
    "\n",
    "| Type/Output | 0 | 1 | 2 |\n",
    "|------|------|------|------|\n",
    "|  B  | 1 | 1 | 1 |\n",
    "|  C  | 2 | 1 | 2 |\n",
    "|  J  | 2 | 1 | 0 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.194107Z",
     "start_time": "2020-03-30T03:43:56.187780Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1.7600000000000002,\n",
       " 0.779791873961373,\n",
       " 4,\n",
       " array([[1.36363636, 0.81818182, 0.81818182],\n",
       "        [2.27272727, 1.36363636, 1.36363636],\n",
       "        [1.36363636, 0.81818182, 0.81818182]]))"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from scipy.stats import chi2_contingency\n",
    "obs = np.array([[1, 1, 1], [2, 1, 2],[2, 1, 0]])\n",
    "chi2_contingency(obs) # the first value is the chi2 stats between the orginal feature\n",
    "# and the target variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.200934Z",
     "start_time": "2020-03-30T03:43:56.195581Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the sum of the sklearn.chi2 results equal to the chi-square stats between the orginal feature\n",
    "# and the target variable\n",
    "\n",
    "chi2(sample.values[:,[1,2,3]],sample.values[:,[0]])[0].sum() == chi2_contingency(obs)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.207000Z",
     "start_time": "2020-03-30T03:43:56.202418Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Power_divergenceResult(statistic=0.17777777777777778, pvalue=0.9149472287300311)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# then how the chi2 is calcualted in sklearn?\n",
    "# take the first result 0.17777778 as an example\n",
    "# this is the chi2 value sklearn calculated for Type B boolean variable\n",
    "# this value 0.17777778 is the same as the result from the below code\n",
    "\n",
    "from scipy.stats import chisquare\n",
    "f_exp = np.array([5/11, 3/11, 3/11]) * 3 # expected occurance = the prior prob of output variable * the numebr of obs in B Type\n",
    "chisquare([1,1,1], f_exp=f_exp) # [1,1,1] is the actual occurance of obs in B Type\n",
    "\n",
    "# so the chi2 function in sklearn only considers the obs in B Type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How to use sklearn chi2 function for feature selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.305211Z",
     "start_time": "2020-03-30T03:43:56.208518Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import chi2\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "\n",
    "# load the dataset\n",
    "from sklearn.datasets import load_iris # use iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "# in this dataset, X is continuous variable, y is categorical variable\n",
    "# does not meet the chi2 requirement \n",
    "\n",
    "# convert to categorical data by converting data to booleans\n",
    "# as demo, convert continuous variable to booleans by whether the value is greater than mean\n",
    "X = X > X.mean(0)\n",
    "\n",
    "# before using iris dataset, we need to shuffle the dataset first\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 obs as train_set\n",
    "# the rest 50 obs as test_set\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "# use pre-built chi-squared function in sklearn\n",
    "selector = SelectKBest(chi2, k=2) # # k => number of top features to select\n",
    "selector.fit(train_set, train_y) # fit on trainset \n",
    "transformed_train = selector.transform(train_set) # transform trainset\n",
    "transformed_train.shape #(100, 2), select the 3rd and 4th features\n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]]) \n",
    "\n",
    "transformed_test = selector.transform(test_set) # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]);\n",
    "# select the 3rd and 4th features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.312820Z",
     "start_time": "2020-03-30T03:43:56.306465Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The ch2 statistics between the 1 feature and target is 29.69, p-value is 0.0\n",
      "The ch2 statistics between the 2 feature and target is 19.42, p-value is 0.0\n",
      "The ch2 statistics between the 3 feature and target is 31.97, p-value is 0.0\n",
      "The ch2 statistics between the 4 feature and target is 31.71, p-value is 0.0\n"
     ]
    }
   ],
   "source": [
    "# validate the result\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score, p_value = chi2(train_set[:,idx].reshape(-1,1), train_y)\n",
    "    print(f\"The ch2 statistics between the {idx + 1} feature and target is {round(score[0],2)}, p-value is {round(p_value[0],3)}\")\n",
    "# so we should select the 3rd and 4th features "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### F-Score (classification problem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In classification tasks, if the features are categorical, then we can use the chi-square statistic to select the top features. However, if features are continuous, then we should use ANOVA F-value. The ANOVA F-value scores examine if we group the numerical feature by the target variable (category), the population means for each group are significantly different. The null hypothesis is that these mean are the same. So we should select features output higher F-Score since higher F-Score means that we should reject the hypothesis therefore features with higher F-Score are more related to the target variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "  \n",
    "$F = \\frac{MSB}{MSE} = \\frac{ \\frac{SS(between)}{m-1}}{ \\frac{SS(error)}{n-m}}$  \n",
    "  \n",
    "Where SS(between) is the Sum of Squares between the groups, specifically the sum of squares between the group means and the grand mean. SS(error) is the Sum of Squares within the groups, specifically sum of squares between the data and the group means. m is the number of groups compared, n is the number of observations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.328141Z",
     "start_time": "2020-03-30T03:43:56.314337Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import f_classif\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "from sklearn.datasets import load_iris # use iris dataset as example\n",
    "\n",
    "# use iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "# in this dataset, X is continuous variable, y is categorical, so we can use \n",
    "# ANOVA-F score to select features\n",
    "\n",
    "# before using iris dataset, we need to shuffle the dataset first\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 obs as train_set\n",
    "# the rest 50 obs as test_set\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "# use pre-built f-score function in sklearn\n",
    "selector = SelectKBest(f_classif, k=2) # k => number of top features to select\n",
    "selector.fit(train_set, train_y) # fit on train set\n",
    "transformed_train = selector.transform(train_set) # transform train set\n",
    "transformed_train.shape #(100, 2), select the 3rd and 4th features\n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]]) # return True\n",
    "\n",
    "transformed_test = selector.transform(test_set) # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]);\n",
    "# select the 3rd and 4th features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.339420Z",
     "start_time": "2020-03-30T03:43:56.330550Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The ANOVA F-Score between the 1 feature and target is 91.39, p-value is 0.0\n",
      "The ANOVA F-Score between the 2 feature and target is 33.18, p-value is 0.0\n",
      "The ANOVA F-Score between the 3 feature and target is 733.94, p-value is 0.0\n",
      "The ANOVA F-Score between the 4 feature and target is 608.95, p-value is 0.0\n"
     ]
    }
   ],
   "source": [
    "# validate the result\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score, p_value = f_classif(train_set[:,idx].reshape(-1,1), train_y)\n",
    "    print(f\"The ANOVA F-Score between the {idx + 1} feature and target is {round(score[0],2)}, p-value is {round(p_value[0],3)}\")\n",
    "# so we should select the 3rd and 4th features "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Mutual Information (classification problem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mutual information measures the dependency between the two variables. MI is equal to zero if and only if two random variables are independent, and higher values reflect higher dependency. Compared with Pearson correlation & F-Score, it also captures the non-linear relationships."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula (the same as 1.1.1.5):  \n",
    "  \n",
    "- For discrete distributions (for both x and y):  \n",
    "  \n",
    "    $I(x, y) = \\sum_{x\\in \\mathit{X}}  \\sum_{x\\in \\mathit{Y}} \\textit{p}_{(X,Y)}(x,y) \\textrm{log}(\\frac{\\textit{p}_{(X,Y)}(x,y)}{\\textit{p}_{X}(x)\\textit{p}_{Y}(y)})$  \n",
    "\n",
    "    Where $\\textit{p}_{(X,Y)}(x,y)$ is the joint probability mass function (PMF) of x and y, $\\textit{p}_{X}(x)$ is the PMF of x.  \n",
    "  \n",
    "- For continuous distributions (for both x and y):  \n",
    "\n",
    "    $I(X, Y) = \\int_X \\int_Y  \\textit{p}_{(X,Y)}(x,y) \\textrm{log}(\\frac{\\textit{p}_{(X,Y)}(x,y)}{\\textit{p}_{X}(x)\\textit{p}_{Y}(y)}) \\, \\, dx dy$  \n",
    "    \n",
    "    Where $\\textit{p}_{(X,Y)}(x,y)$ is the joint probability density function (PDF) of x and y, $\\textit{p}_{X}(x)$ is the PDF of x.  \n",
    "  \n",
    "  \n",
    "But in reality, is it likely that one of x and y is discrete variable and the another is continuous variable. So in sklearn, it implement the nonparametric methods based on entropy estimation from k-nearest neighbors distances proposed in [1] and [2].  \n",
    "  \n",
    "  \n",
    "[1] A. Kraskov, H. Stogbauer and P. Grassberger, “Estimating mutual information”. Phys. Rev. E 69, 2004.  \n",
    "[2] B. C. Ross “Mutual Information between Discrete and Continuous Data Sets”. PLoS ONE 9(2), 2014. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.353940Z",
     "start_time": "2020-03-30T03:43:56.341244Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import mutual_info_classif\n",
    "from sklearn.feature_selection import SelectKBest\n",
    "from sklearn.datasets import load_iris # use iris dataset as example\n",
    "\n",
    "# use iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "# in this dataset, X is continuous variable, y is categorical, so we can use \n",
    "# MI to select features\n",
    "\n",
    "# before using iris dataset, we need to shuffle the dataset first\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 observations as train_set\n",
    "# the rest 50 100 observations as test_set\n",
    "train_set = X[0:100,:].astype(float)\n",
    "test_set = X[100:,].astype(float)\n",
    "train_y = y[0:100].astype(float)\n",
    "\n",
    "# since n_neighbors in the KNN is also a very important parameters\n",
    "# so we write a new MI function function based on pre-built MI function in sklearn\n",
    "# to allow more flexibility\n",
    "\n",
    "def udf_MI(X, y):\n",
    "    result = mutual_info_classif(X, y, n_neighbors = 5) # user_defined n_neighbors\n",
    "    return result\n",
    "\n",
    "selector = SelectKBest(f_classif, k=2) # k => number of top features to select\n",
    "selector.fit(train_set, train_y) # fit on the trainset\n",
    "transformed_train = selector.transform(train_set) # transform train set\n",
    "transformed_train.shape #(100, 2), select the 3rd and 4th features\n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]]) # return True\n",
    "\n",
    "transformed_test = selector.transform(test_set) # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]);\n",
    "# so we should select the 3rd and 4th features "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.375922Z",
     "start_time": "2020-03-30T03:43:56.355721Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The MI between the 1 feature and target is 0.54\n",
      "The MI between the 2 feature and target is 0.28\n",
      "The MI between the 3 feature and target is 0.99\n",
      "The MI between the 4 feature and target is 1.01\n"
     ]
    }
   ],
   "source": [
    "# validate the result\n",
    "for idx in range(train_set.shape[1]):\n",
    "    score = mutual_info_classif(train_set[:,idx].reshape(-1,1), train_y, n_neighbors = 5)\n",
    "    print(f\"The MI between the {idx + 1} feature and target is {round(score[0],2)}\")\n",
    "# so we should select the 3rd and 4th features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multivariate Filter Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compared with Univariate Filter Methods, Multivariate filter methods select the best feature based on the entire feature space. The relationships between features are taken into account so it performs better in removing redundant features. Here utilize the [skfeature](https://github.com/jundongl/scikit-feature) module developed by Arizona State University."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Max-Relevance Min-Redundancy (mRMR) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The mRMR method tries to find a subset of features that have a higher association (MI) with the target variable while at the same time have lower inter-association with all the features already in the subset. We did some research on the source code, we figure out that  the implementation of mRMR in skfeature only works for discrete features in the classification problems since it uses discrete Mutual Information formula during the calculation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "\n",
    "Assuming dataset contains m features, the n-th feature importance based on the mRMR criterion for feature $X_i$ can be expressed as:  \n",
    "  \n",
    "$f^{mRMR}(X_i) = I(Y, X_i) - \\frac{1}{|S|}\\sum_{X_s \\in S} I(X_s, X_i)$  \n",
    "\n",
    "$I(Y, X_i)$ is the MI between feature $X_i$ and target variable. $\\frac{1}{|S|}\\sum_{X_s \\in S} I(X_s, X_i)$ is the average MI between feature $X_i$ and all the features already in the subset.  \n",
    "\n",
    "mRMR is a step-wise method, at each step, the feature $X_i, (X_i \\notin  S)$ with the highest feature importance score $f^{mRMR}(X_i)$ will be added to the subset until reach desired number of features in the subset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.391187Z",
     "start_time": "2020-03-30T03:43:56.377775Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.information_theoretical_based import MRMR\n",
    "from sklearn.datasets import load_iris # use iris dataset as example\n",
    "\n",
    "# load iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# before using iris dataset, we need to shuffle the dataset first\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 observations as train_set\n",
    "# the rest 50 observations as test_set\n",
    "# since the mRMR in skfeature only works for discrete features\n",
    "# so we will cast continuous variable to discrete variable \n",
    "# by converting float to int for demonstration purpose only.\n",
    "\n",
    "train_set = X[0:100,:].astype(int)\n",
    "test_set = X[100:,].astype(int)\n",
    "train_y = y[0:100].astype(int)\n",
    "\n",
    "feature_index,_,_ = MRMR.mrmr(train_set, train_y, n_selected_features=2) # fit on the trainset\n",
    "transformed_train = train_set[:,feature_index] # transform train set\n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]])  # select the 3rd and 4th features\n",
    "\n",
    "transformed_test = test_set[:,feature_index]  # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]) # select the 3rd and 4th features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Correlation-based Feature Selection (CFS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to mRMR, CFS method is also based on a simple assumption: Good feature subsets should contain features highly correlated with the target and uncorrelated to each other. By analyzing the source code, we find that in skfeature, the CFS method is only applicable to discrete features in classification problems. Because it uses discrete symmetrical uncertainty (SU) as a measure of the correlation between variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "\n",
    "$Merit_S =\\frac{ \\sum_{i=1}^{k} SU(X_i, y)}{\\sqrt{k + \\sum_{i=1}^{k} \\sum_{j=1}^{k} SU(X_i, X_j) }}, \\, \\, \\, X_i \\in S^*$  \n",
    "  \n",
    "$SU(X, Y) = 2 * \\frac{H(X) + H(Y) - H(X|Y)}{H(X) + H(Y)}$  \n",
    "  \n",
    "$S$ is the subset. We try to find a subset $S^*$ that maximize $Merit_S$.  \n",
    "$SU(X_i, y)$ is the symmetrical uncertainty (SU) between feature $X_i$ and $y$.   \n",
    "$SU(X_i, X_j)$ is the symmetrical uncertainty (SU) between feature $X_i$ and $X_j$.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.414989Z",
     "start_time": "2020-03-30T03:43:56.395469Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.statistical_based import CFS\n",
    "from sklearn.datasets import load_iris # use iris dataset as example\n",
    "\n",
    "# load iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# before using iris dataset, we need to shuffle the dataset first\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 observations as train_set\n",
    "# the rest 50 observations as test_set\n",
    "# since the CFS in skfeature only works for discrete features\n",
    "# so we will cast continuous variable to discrete variable \n",
    "# by converting float to int for demonstration purpose only.\n",
    "\n",
    "train_set = X[0:100,:].astype(int)\n",
    "test_set = X[100:,].astype(int)\n",
    "train_y = y[0:100].astype(int)\n",
    "\n",
    "num_feature = 2 # we select a subset contains two features\n",
    "feature_index = CFS.cfs(train_set, train_y) # fit on the trainset\n",
    "transformed_train = train_set[:,feature_index[0:num_feature]] # transform train set\n",
    "assert np.array_equal(transformed_train, train_set[:,[3,2]])  # select the 3rd and 4th features\n",
    "\n",
    "transformed_test = test_set[:,feature_index[0:num_feature]]   # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[3,2]]) # select the 3rd and 4th features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-01-16T21:15:44.699047Z",
     "start_time": "2020-01-16T21:15:44.694402Z"
    }
   },
   "source": [
    "#### Fast Correlation-based Filter (FCBF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compared with CFS, FCBS can filter and select variables more efficiently. FCBF is also a step-wise method similar to mRMR, but FCBS uses symmetric uncertainty (SU) to measure the correlation between variables. FCBF firstly removes variables that have lower SU values with the target variable. Secondly, it sorts the remaining variables based on their SU value with the target variable from highest to lowest. Then it deletes redundant features one by one. Similar to mRMR, CFS, the FCBF implemented in skfeature is only suitable for classification problems with discrete variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "  \n",
    "$SU(X, Y) = 2 * \\frac{H(X) + H(Y) - H(X|Y)}{H(X) + H(Y)}$  \n",
    "  \n",
    "Steps:  \n",
    "1) Calculate the SU value $SU(X_i, y)$ between each feature $X_i$ and the target variable $y$.  \n",
    "2) Delete the features that has $ SU(X_i, y)$ lower than a certain threshold $\\sigma$, create the candidate list $S_ {list}$ with the remaining features.  \n",
    "3) Sort the variables in candiate list $S_{list}$ based on their SU value $SU(X_i,y)$ with the target variable from highest to lowest.  \n",
    "4) Following the order in the $S_{list}$, for each feature in the candiate list $S_{list}$,  calculate the SU value between feature $X_i$ and feature $X_j$ if $SU(X_i, y)$ is greater than $SU(X_j, y)$.  \n",
    "5) If $SU (X_i,X_j)$ is greater than $ SU(X_j,y)$, then delete feature $X_j$ from the candidate list $S_ {list}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.431245Z",
     "start_time": "2020-03-30T03:43:56.417234Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.information_theoretical_based import FCBF\n",
    "from sklearn.datasets import load_iris # use iris dataset as example\n",
    "\n",
    "# load iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# before using iris dataset, we need to shuffle the dataset first\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 observations as train_set\n",
    "# the rest 50 observations as test_set\n",
    "# since the FCFS in skfeature only works for discrete features\n",
    "# so we will cast continuous variable to discrete variable \n",
    "# by converting float to int for demonstration purpose only.\n",
    "\n",
    "train_set = X[0:100,:].astype(int)\n",
    "test_set = X[100:,].astype(int)\n",
    "train_y = y[0:100].astype(int)\n",
    "\n",
    "num_feature = 2 # we select a subset contains two features\n",
    "feature_index = FCBF.fcbf(train_set, train_y, n_selected_features=num_feature)[0] # fit on the trainset\n",
    "transformed_train = train_set[:,feature_index[0:num_feature]] # transform train set\n",
    "assert np.array_equal(transformed_train, train_set[:,[3, 2]])  # only the 3rd, 4th feature is selected!\n",
    "# because other variables have low SU value with the target variable y\n",
    "\n",
    "transformed_test = test_set[:,feature_index[0:num_feature]]   # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[3, 2]]) # only the 3rd, 4th feature is selected!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### ReliefF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Relief F method is based on the Relief method. The Relief method is a Feature weighting algorithm. It assigns features with higher weights if they have a higher correlation with the target variable (binary classification), and removes features with weights below a certain threshold. The correlation in the Relief method is defined as the ability to discriminate close observations.\n",
    "\n",
    "In each step, the algorithm randomly selects an observation S from the training set, and then it finds the nearest neighbor observation of S that has the same target label, called it NearHit. It will also find the nearest neighbor observation of S that has a different label, called it NearMiss. It then updates the weight of each feature according to the following rules:\n",
    "\n",
    "1) If the distance between observations S and Near Hit on a feature is less than the distance between observations S and Near Miss, the weight of the feature will be increased since the feature is beneficial for discriminating labels in its nearest neighbors.\n",
    "\n",
    "2) Conversely, if the distance between observations S and Near Hit on a feature is greater than the distance on R and Near Miss, the weight of the feature is reduced.\n",
    "\n",
    "The above process is repeated over m times, and finally, the average weight of each feature is obtained. The larger the weight of a feature, the stronger the classification ability of the feature.\n",
    "\n",
    "In Relief F, it modified the way to update the weights so it can be applied to the multi-class classification problem. Also, it randomly samples K nearest observations instead of one.\n",
    "\n",
    "ReliefF implemented in skfeature works for continuous features or binary categorical features in classification problems since it uses the L1 norm to measure the differences. For non-binary categorical features, we can first encode it one-hot encoding and then apply the ReliefF method for feature selection."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "  \n",
    "$W(X_i) = W(X_i) - \\frac{\\sum_{i=1}^{k} diff(X_i, S, H_j) }{m*k} + \\frac{\\sum_{C\\notin class(S)} [\\frac{p(C)}{1-P(class(S))} \\sum_{i=1}^{k} diff(X_i, S, M_j(C)) ]}{m*k}$  \n",
    "\n",
    "$diff(X_i, R_1, R_2) = \\left\\{\\begin{matrix}\n",
    "\\frac{|R_1(X_i) - R_2(X_i)|}{max(X_i)- min(X_i)} & if X_i\\ is\\ continuous\\\\ \n",
    " 0 & if X_i\\ is\\ discrete\\ and \\ R_1(X_i) =R_2(X_i)\\\\ \n",
    " 1 & if X_i\\ is\\ discrete\\ and \\ R_1(X_i) \\neq R_2(X_i)\n",
    "\\end{matrix}\\right.$  \n",
    "  \n",
    "$R_1$ and $R_2$ are two observations. $X_i$ is the feature we are working on. S is the observation we selected. $H_j$ is the j-th NearHit, $M_j$ is the j-th NearMiss. C are the classes that are different from the class of observation we selected. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.468383Z",
     "start_time": "2020-03-30T03:43:56.433395Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.similarity_based import reliefF\n",
    "from sklearn.datasets import load_iris # use iris dataset as example\n",
    "\n",
    "# load iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# before using iris dataset, we need to shuffle the dataset first\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 observations as train_set\n",
    "# the rest 50 observations as test_set\n",
    "# The reliefF in skfeature directly support continuous features\n",
    "\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "num_feature = 2 # we select a subset contains two features\n",
    "score = reliefF.reliefF(train_set, train_y) # calculate the weight for each feature\n",
    "feature_index = reliefF.feature_ranking(score) # rank index based on scores\n",
    "transformed_train = train_set[:,feature_index[0:num_feature]] # transform train set\n",
    "assert np.array_equal(transformed_train, train_set[:,[2,3]])  # select the 3rd and 4th features\n",
    "\n",
    "transformed_test = test_set[:,feature_index[0:num_feature]]   # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[2,3]]) # select the 3rd and 4th features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-01-16T21:48:57.048462Z",
     "start_time": "2020-01-16T21:48:57.043984Z"
    }
   },
   "source": [
    "#### Spectral Feature Selection (SPEC)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SPEC method is an unsupervised method built on spectral graph theory. It firstly builds up the similarity set S and constructing its graph representation. Then it evaluates features based on the spectrum of the constructed graph. The SPEC implemented in skfeature works for continuous features or binary categorical features in classification problems since it uses the RBF (Gaussian) kernel as the similarity set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:56.487803Z",
     "start_time": "2020-03-30T03:43:56.469842Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from skfeature.function.similarity_based import SPEC\n",
    "from sklearn.datasets import load_iris # use iris dataset as example\n",
    "\n",
    "# load iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# before using iris dataset, we need to shuffle the dataset first\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 observations as train_set\n",
    "# the rest 50 observations as test_set\n",
    "# The SPEC method in skfeature directly support continuous features\n",
    "\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "num_feature = 2 # we select a subset contains two features\n",
    "score = SPEC.spec(train_set) # calculate the scores for each feature\n",
    "feature_index = SPEC.feature_ranking(score)  # rank index based on scores\n",
    "transformed_train = train_set[:,feature_index[0:num_feature]] # transform train set\n",
    "assert np.array_equal(transformed_train, train_set[:,[1,0]])  # select the 1st and 2nd features\n",
    "\n",
    "transformed_test = test_set[:,feature_index[0:num_feature]]   # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[1,0]]) # select the 1st and 2nd features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrapper Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wrapper methods take the feature selection problem as a search problem, that is, it tries to find an optimal feature subset that performs the best in the supervised ML model. In each step, it trains the model on one feature subset and then evaluates it. In the next step, it adjusts the feature subset, retrains the model and evaluates the subset, so on and so forth. The exhaustive search is an NP-Hard problem, so people have proposed some methods to reduce the number of iterations required for the wrapper method so that a good feature subset can be found out in a limited time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-02-01T03:17:41.041775Z",
     "start_time": "2020-02-01T03:17:40.925277Z"
    }
   },
   "source": [
    "![image](./images/Wrapper_Pipeline.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deterministic Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deterministic Algorithms will always output the same subset of features given the same data input. Sequential Forward Selection (SFS), Sequential Backward Selection (SBS) are examples of the Deterministic Algorithm.\n",
    "\n",
    "SFS starts from a model fitting on one single feature, and on every step, it adds one new feature into the existing feature subset that outputs the maximum increase in model performance. The iteration stops when the number of selected features meets the requirement.\n",
    "\n",
    "Whereas SBS starts from a model fitting on all features, and on every step, it deletes the lease important feature from the subset. The iteration stops when the number of selected features meets the requirement.\n",
    "\n",
    "But both SFS & SBS are step-wise methods and are likely to get stuck at local optimal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Recursive Feature Elimination (SBS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In sklearn, it implements Recursive Feature Elimination (SBS) only. Sklearn provides two Recursive Feature Elimination functions, one is RFE and the other is RFECV. Compared with RFE function, the REFCV function uses cross-validated results to select the best number of features, whereas in RFE, the number of features to select is predefined by users."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:43:59.461478Z",
     "start_time": "2020-03-30T03:43:56.489773Z"
    }
   },
   "outputs": [],
   "source": [
    "# RFE\n",
    "import numpy as np\n",
    "from sklearn.feature_selection import RFE\n",
    "\n",
    "# load dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "\n",
    "# use the first 15000 observations as train_set\n",
    "# the rest as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# define a predictive model to evaluate feature subsets\n",
    "from sklearn.ensemble import ExtraTreesRegressor # we use extratree model\n",
    "clf = ExtraTreesRegressor(n_estimators=25)\n",
    "selector = RFE(estimator = clf, n_features_to_select = 4, step = 1) \n",
    "# no cv in RFE, we need to specify the number of features to select\n",
    "# select 4 features, each step we only remove one feature\n",
    "\n",
    "selector = selector.fit(train_set, train_y) # fit on train set\n",
    "\n",
    "transformed_train = train_set[:,selector.support_] # transform trainset\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,5,6,7]]) # select the 1st, 6th, 7th, 8th features\n",
    "\n",
    "transformed_test = test_set[:,selector.support_] # transform test set\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,5,6,7]]) # select the 1st, 6th, 7th, 8th features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:44:17.248861Z",
     "start_time": "2020-03-30T03:43:59.462942Z"
    }
   },
   "outputs": [],
   "source": [
    "# RFECV\n",
    "from sklearn.feature_selection import RFECV\n",
    "\n",
    "# load dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "\n",
    "# use the first 15000 observations as train_set\n",
    "# the rest as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# define a predictive model to evaluate feature subsets\n",
    "from sklearn.ensemble import ExtraTreesRegressor # we use extratree model\n",
    "clf = ExtraTreesRegressor(n_estimators=25)\n",
    "selector = RFECV(estimator = clf, step = 1, cv = 5) # use 5-fold cross validation\n",
    "# each step we only remove one feature\n",
    "\n",
    "selector = selector.fit(train_set, train_y) # fit on train set\n",
    "\n",
    "transformed_train = train_set[:,selector.support_] # transform trainset\n",
    "assert np.array_equal(transformed_train, train_set) # select all the features\n",
    "\n",
    "transformed_test = test_set[:,selector.support_] # transform test set\n",
    "assert np.array_equal(transformed_test, test_set) # select all the features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-01-17T16:49:13.141829Z",
     "start_time": "2020-01-17T16:49:13.137293Z"
    }
   },
   "source": [
    "### Randomized Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compared with the Deterministic Algorithm, the Randomized Algorithms employs a degree of randomness when searching the best subset. So it might output different subsets of features given the same data input but the randomness will help the model avoid stuck at the local optimal results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Simulated Annealing (SA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Simulated annealing is a controlled random search method. Each time, we will select a feature subset at random based on the current solution. If the new subset works better, then we will adopt it. If the new subset works worse, we will still accept it but at some probability determined by the current state.\n",
    "\n",
    "Accepting a worse solution is crucial in the SA algorithm because this helps to avoid stuck at local optimal. As the iteration going, the SA algorithm should reach and converge to a good and stable solution.\n",
    "\n",
    "Since currently there aren't any packages implemented the SA algorithm well, so I wrote a [python script](./SA.py) to implement the SA algorithm for your reference. It supports both classification and regression problems. It also provides cross-validation support."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formula:  \n",
    "  \n",
    "The probability of accepting a worse solution is as below:  \n",
    "$Prob = exp( - \\frac{loss_{n} - loss_{o}}{k * Cur\\_{Temperature}})$  \n",
    "  \n",
    "Where $loss_n$ is the new loss, $loss_o$ is the lowest score achieved before fitting the new model, $Cur\\_{Temperature}$ is the current temperature. \n",
    "  \n",
    "The pseudo code of the SA algorithm:  \n",
    "\n",
    "![image](./images/SA_Pseudo_Code.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regression example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:45:21.155811Z",
     "start_time": "2020-03-30T03:44:17.250889Z"
    }
   },
   "outputs": [],
   "source": [
    "from SA import Simulated_Annealing # import the python script with SA.\n",
    "\n",
    "# regression example\n",
    "# load the dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "\n",
    "# use the first 15000 observations as train_set\n",
    "# the rest observations as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# define a predictive model\n",
    "from sklearn.ensemble import ExtraTreesRegressor # we use extratree as predictive model\n",
    "\n",
    "# define the loss function in SA algorithm\n",
    "from sklearn.metrics import mean_squared_error # for regression task we use MSE\n",
    "\n",
    "clf = ExtraTreesRegressor(n_estimators=25)\n",
    "selector = Simulated_Annealing(loss_func = mean_squared_error, estimator = clf, \n",
    "                               init_temp = 0.2, min_temp = 0.005, iteration = 10, alpha = 0.9)\n",
    "# fit on trainset\n",
    "# parameters detail can be viewed from SA.py\n",
    "\n",
    "selector.fit(X_train = train_set, y_train = train_y, cv = 5) # use 5-fold cross-validation\n",
    "\n",
    "transformed_train = selector.transform(train_set) # transform trainset\n",
    "transformed_test = selector.transform(test_set)  # transform test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:45:21.159895Z",
     "start_time": "2020-03-30T03:45:21.157307Z"
    }
   },
   "outputs": [],
   "source": [
    "selector.best_sol # return the best solution feature index;\n",
    "selector.best_loss; # return the loss associated with the best solution; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Classification example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:45:21.478476Z",
     "start_time": "2020-03-30T03:45:21.162042Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random\n",
    "from SA import Simulated_Annealing # import the python script with SA.\n",
    "\n",
    "# classification example\n",
    "# use iris dataset\n",
    "from sklearn.datasets import load_iris\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# random suffle the dataset\n",
    "# use the first 100 observations as train_set\n",
    "# the rest top 20 observations as validation set\n",
    "# the rest 30 observations as test set\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "train_set = X[0:100,:]\n",
    "val_set = X[100:120,:]\n",
    "test_set = X[120:,:]\n",
    "\n",
    "train_y = y[0:100]\n",
    "val_y = y[100:120]\n",
    "test_y = y[120:]\n",
    "\n",
    "# reset random seed! \n",
    "# Both SA & GA need to introduce randomness\n",
    "random.seed()\n",
    "np.random.seed()\n",
    "\n",
    "# define a predictive model\n",
    "from sklearn.ensemble import ExtraTreesClassifier # we use extratree as predictive model\n",
    "\n",
    "# define the loss function in SA algorithm\n",
    "from sklearn.metrics import log_loss # for classification we use entropy\n",
    "\n",
    "clf = ExtraTreesClassifier(n_estimators=25)\n",
    "selector = Simulated_Annealing(loss_func = log_loss, estimator = clf, \n",
    "                               init_temp = 0.2, min_temp = 0.005, iteration = 10, \n",
    "                               alpha = 0.9, predict_type = 'predict_proba')\n",
    "# fit on trainset\n",
    "# parameters detail can be viewed from SA.py\n",
    "\n",
    "selector.fit(X_train = train_set, y_train = train_y, X_val = val_set, \n",
    "             y_val = val_y, stop_point = 10)\n",
    "# the pipeline also allows user to specify validation set for feature selection\n",
    "# here we will give it a try\n",
    "\n",
    "transformed_train = selector.transform(train_set) # transform trainset\n",
    "transformed_test = selector.transform(test_set)  # transform test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:45:21.482118Z",
     "start_time": "2020-03-30T03:45:21.479972Z"
    }
   },
   "outputs": [],
   "source": [
    "selector.best_sol # return the best solution feature index;\n",
    "selector.best_loss; # return the loss associated with the best solution; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Genetic Algorithm (GA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The genetic algorithm is a stochastic optimization tool based on concepts of evolution population biology. It mimics the evolutionary process in nature and solves the optimization problem by allowing the solutions to reproduce and create new solutions (generations) by \"crossover\" and \"mutation\". It also incorporates the competition concept and only allows the fittest solutions (in our cases, feature subsets that result in the lowest loss) to survive and populate their subsequent generation. GA algorithm should converge to an optimization solution after a certain generation evolution.\n",
    "\n",
    "I also wrote a [python script](GA.py) to implement the GA algorithm for your reference. It provides two algorithms including 'one-max' and 'NSGA2'. 'One-max' is the traditional one objective GA algorithm and 'NSGA2' is a multi-objective GA algorithm. In our case, the target of 'one-max' is to reduce the loss, while the target of 'NSGA2' is to minimize both the loss and also the number of features in the subset. \n",
    "  \n",
    "The python script supports both classification and regression problems. It also provides cross-validation support."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The pseudo code of the GA algorithm:  \n",
    "\n",
    "![image](./images/GA_Pseudo_Code.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regression example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.363842Z",
     "start_time": "2020-03-30T03:45:21.483715Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10/10 [01:09<00:00,  7.00s/it]\n"
     ]
    }
   ],
   "source": [
    "from GA import Genetic_Algorithm # import the python script with SA.\n",
    "\n",
    "# regression example\n",
    "# load the dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "\n",
    "# use the first 15000 observations as train_set\n",
    "# the rest observations as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# define a predictive model\n",
    "from sklearn.ensemble import ExtraTreesRegressor # we use extratree as predictive model\n",
    "\n",
    "# define the loss function in SA algorithm\n",
    "from sklearn.metrics import mean_squared_error # for regression task we use MSE\n",
    "\n",
    "clf = ExtraTreesRegressor(n_estimators=25)\n",
    "selector = Genetic_Algorithm(loss_func = mean_squared_error, estimator = clf, \n",
    "                             n_gen = 10, n_pop = 20, algorithm = 'NSGA2')\n",
    "# fit on trainset\n",
    "# parameters detail can be viewed from GA.py\n",
    "\n",
    "selector.fit(X_train = train_set, y_train = train_y, cv = 5) # use 5-fold cross-validation\n",
    "\n",
    "transformed_train = selector.transform(train_set) # transform trainset\n",
    "transformed_test = selector.transform(test_set)  # transform test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.368582Z",
     "start_time": "2020-03-30T03:47:02.365962Z"
    }
   },
   "outputs": [],
   "source": [
    "selector.best_sol # return the best solution feature index;\n",
    "selector.best_loss; # return the loss associated with the best solution; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Classification example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.615923Z",
     "start_time": "2020-03-30T03:47:02.370729Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 5/5 [00:00<00:00, 70.78it/s]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "from GA import Genetic_Algorithm # import the python script with GA.\n",
    "\n",
    "# classification example\n",
    "# use iris dataset\n",
    "from sklearn.datasets import load_iris\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# random suffle the dataset\n",
    "# use the first 100 observations as train_set\n",
    "# the rest top 20 observations as validation set\n",
    "# the rest 30 observations as test set\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "train_set = X[0:100,:]\n",
    "val_set = X[100:120,:]\n",
    "test_set = X[120:,:]\n",
    "\n",
    "train_y = y[0:100]\n",
    "val_y = y[100:120]\n",
    "test_y = y[120:]\n",
    "\n",
    "# reset random seed! \n",
    "# Both SA & GA need to introduce randomness\n",
    "random.seed()\n",
    "np.random.seed()\n",
    "\n",
    "# define a predictive model\n",
    "from sklearn.ensemble import ExtraTreesClassifier # we use extratree as predictive model\n",
    "\n",
    "# define the loss function in SA algorithm\n",
    "from sklearn.metrics import log_loss # for classification we use entropy\n",
    "\n",
    "clf = ExtraTreesClassifier(n_estimators=25)\n",
    "selector = Genetic_Algorithm(loss_func = log_loss, estimator = clf, \n",
    "                             n_gen = 5, n_pop = 10, predict_type = 'predict_proba')\n",
    "# parameters detail can be viewed from GA.py\n",
    "\n",
    "selector.fit(X_train = train_set, y_train = train_y, X_val = val_set, \n",
    "             y_val = val_y)\n",
    "# the pipeline also allows user to specify validation set for feature selection\n",
    "# here we will give it a try\n",
    "\n",
    "transformed_train = selector.transform(train_set) # transform trainset\n",
    "transformed_test = selector.transform(test_set)  # transform test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.620523Z",
     "start_time": "2020-03-30T03:47:02.617727Z"
    }
   },
   "outputs": [],
   "source": [
    "selector.best_sol # return the best solution feature index;\n",
    "selector.best_loss; # return the loss associated with the best solution; "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embedded Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-01-25T21:07:27.865808Z",
     "start_time": "2020-01-25T21:07:27.860826Z"
    }
   },
   "source": [
    "The selection process of Filter Methods is independent of the ML models, so Filter Methods might select features that are less important in the ML models and might lead to poor model performance.\n",
    "\n",
    "Wrapper Methods utilize the predefined ML models to select the best features. But since they need to train models many times on a large number of possible subsets, they require long processing time although they usually lead to better performance than the Filter Methods.\n",
    "\n",
    "Embedded Methods embeds the feature selection process inside the ML models. They learn the best feature subset while the model is being created. So compared with the Filter Methods, they tend to have better performance. Compared with the Wrapper Methods, they save large processing time and computing power."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Comparision between three approaches**.  \n",
    "   \n",
    "|Aspects | Filter Methods | Wrapper Methods\t| Embedded Methods\n",
    "|--|--|--|--|\n",
    "|Reliance on Model| No | Yes | Yes |\n",
    "|Requirement on Cross Validation |\tMaybe (Can use CV to select the number of features) | Yes | Maybe (Can use CV to select the number of features)\n",
    "|Process Time |\tShort | Long | Medium\n",
    "|Restriction on the ML models|\tNo | No | Yes (linear models with L1 or L2 norm or tree-based models)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![image.png](./images/Embedded_Pipeline.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regulization Based Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many ML models introduce penalties (L1 norm or L2 norm) in their loss functions to prevent the overfitting problem. The L1 norm penalization in linear models (such as Linear SVC, Logistic Regression, Linear Regression) tends to shrink the feature coefficients of some features to zero therefore results in sparse solutions. So we can assign scores to each feature by their coefficients in the linear model with regularization. The higher the coefficients, the more important the features in a linear model.\n",
    "\n",
    "We can use sklearn SelectFromModel function to remove features that have low or zero feature coefficients."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Lasso Regression (Linear Regression with L1 Norm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.849021Z",
     "start_time": "2020-03-30T03:47:02.622378Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.346,  0.003, -0.   , -0.   , -0.   , -0.   , -0.033,  0.   ])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.linear_model import Lasso # we can also use Ridge regression with L2 norm \n",
    "\n",
    "# regression example\n",
    "# load the dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "\n",
    "# use the first 15000 observations as train_set\n",
    "# the rest observations as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "clf = Lasso(normalize=True, alpha = 0.001)  # we need to normalize the data first or \n",
    "# the coefficient is meanless and uncomparable\n",
    "# if alpha is set too high, then the L1 norm will shrink every coefficients to 0\n",
    "# the bigger alpha, the stronger the penalty\n",
    "\n",
    "clf.fit(train_set, train_y)\n",
    "np.round(clf.coef_ ,3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.855973Z",
     "start_time": "2020-03-30T03:47:02.850885Z"
    }
   },
   "outputs": [],
   "source": [
    "selector = SelectFromModel(clf, prefit=True, threshold=1e-5)\n",
    "# threshold is set to 1e-5, so the features with absolute coefficients below 1e-5 \n",
    "# will be removed\n",
    "# we can also set the max_features parameters to select the top features\n",
    "\n",
    "transformed_train = selector.transform(train_set)\n",
    "transformed_test = selector.transform(test_set)\n",
    "\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,1,6]]) \n",
    "# select the 1st, 2nd, 7th features\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,1,6]]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Logistic Regression (with L1 Norm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.873760Z",
     "start_time": "2020-03-30T03:47:02.857707Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.   ,  1.   , -3.452, -0.159],\n",
       "       [ 0.   , -1.201,  0.053,  0.   ],\n",
       "       [ 0.   ,  0.   ,  1.331,  3.27 ]])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.datasets import load_iris # use iris dataset as example\n",
    "\n",
    "# use iris dataset\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "# random suffle the dataset\n",
    "np.random.seed(1234)\n",
    "idx = np.random.permutation(len(X))\n",
    "X = X[idx]\n",
    "y = y[idx]\n",
    "\n",
    "# use the first 100 observations as train_set\n",
    "# the rest 50 observations as test_set\n",
    "train_set = X[0:100,:]\n",
    "test_set = X[100:,]\n",
    "train_y = y[0:100]\n",
    "\n",
    "# we need to standardize the data first or the coefficient is useless and uncomparable\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "model = StandardScaler()\n",
    "model.fit(train_set) \n",
    "standardized_train = model.transform(train_set)\n",
    "standardized_test = model.transform(test_set)\n",
    "\n",
    "clf = LogisticRegression(penalty='l1', C = 0.7, \n",
    "                         random_state=1234, solver='liblinear') \n",
    "# can set the penalty to 'l2'\n",
    "# if C is set too low, then the L1 norm will shrink every coefficients to 0\n",
    "# the bigger C, the weaker the penalty\n",
    "\n",
    "clf.fit(standardized_train, train_y)\n",
    "np.round(clf.coef_,3) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.880258Z",
     "start_time": "2020-03-30T03:47:02.875670Z"
    }
   },
   "outputs": [],
   "source": [
    "selector = SelectFromModel(clf, prefit=True, threshold=1e-5)\n",
    "# threshold is set to 1e-5, so the features with absolut coefficients below 1e-5 \n",
    "# will be removed\n",
    "# we can also set the max_features parameters to select the top features\n",
    "\n",
    "transformed_train = selector.transform(train_set)\n",
    "transformed_test = selector.transform(test_set)\n",
    "\n",
    "assert np.array_equal(transformed_train, train_set[:,[1,2,3]]) \n",
    "# select the 2nd, 3rd, 4th features\n",
    "assert np.array_equal(transformed_test, test_set[:,[1,2,3]]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### LinearSVR/ LinearSVC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.906992Z",
     "start_time": "2020-03-30T03:47:02.882152Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.254,  0.026,  0.026, -0.017,  0.032, -0.01 , -0.1  , -0.037])"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we can use LinearSVC for classification\n",
    "# Or LinearSVR for regression\n",
    "\n",
    "import numpy as np\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.svm import LinearSVR\n",
    "\n",
    "# regression example\n",
    "# load the dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "\n",
    "# use the first 15000 observsations as train_set\n",
    "# the rest observsations as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# we need to standardize the data first or the coefficient is useless and uncomparable\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "model = StandardScaler()\n",
    "model.fit(train_set) \n",
    "standardized_train = model.transform(train_set)\n",
    "standardized_test = model.transform(test_set)\n",
    "\n",
    "clf = LinearSVR(C = 0.0001, random_state = 123) \n",
    "# the bigger C, the weaker the penalty\n",
    "\n",
    "clf.fit(standardized_train, train_y)\n",
    "np.round(clf.coef_,3) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:02.914941Z",
     "start_time": "2020-03-30T03:47:02.908595Z"
    }
   },
   "outputs": [],
   "source": [
    "selector = SelectFromModel(clf, prefit=True, threshold=1e-2)\n",
    "# threshold is set to 1e-2, so the features with absolut coefficients below 1e-2\n",
    "# will be removed\n",
    "# we can also set the max_features parameters to select the top features\n",
    "\n",
    "transformed_train = selector.transform(train_set)\n",
    "transformed_test = selector.transform(test_set)\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,1,2,3,4,6,7]]) \n",
    "# only remove the 6th feature\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,1,2,3,4,6,7]]) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tree Based Methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many powerful tree-based ML models such as random forest, AdaBoost, Xgboost, etc. You can check out more introduction about these tree-based ML models in a series of blogs written by my friend and me [here](https://github.com/YC-Coder-Chen/Tree-Math).  \n",
    "  \n",
    "These non-parametric models record how each feature is used to reduce the loss by splitting the tree nodes and can report the feature importances of each feature based on the above records. The feature importances can be used to discard irrelevant features. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:06.533338Z",
     "start_time": "2020-03-30T03:47:02.916501Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.52 , 0.045, 0.031, 0.026, 0.027, 0.139, 0.106, 0.107])"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we use the random forest regressor model as the example\n",
    "import numpy as np\n",
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "# load the dataset\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "dataset = fetch_california_housing()\n",
    "X, y = dataset.data, dataset.target # use california_housing dataset as example\n",
    "\n",
    "# use the first 15000 observations as train_set\n",
    "# the rest observations as test_set\n",
    "train_set = X[0:15000,:]\n",
    "test_set = X[15000:,]\n",
    "train_y = y[0:15000]\n",
    "\n",
    "# we don't need to normaliz the dataset in tree-based feature selection\n",
    "\n",
    "clf = RandomForestRegressor(n_estimators = 50, random_state = 123)\n",
    "\n",
    "clf.fit(train_set, train_y)\n",
    "np.round(clf.feature_importances_, 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:06.866836Z",
     "start_time": "2020-03-30T03:47:06.535257Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAu8AAALJCAYAAAAEds16AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzs3Xm85mdB3/3vlwxLJlDiBI0sgREM2LAkQKIgi1EwFqiAgg+JKIYHRXAJPjoutfoUsX2qhUcqxdYGishShgJCg4KJoICALJOQzbBUkiiCZZFFAhEMXP3j/EYO40zmzHrPlbzfr9e8zn3/1uu+f69JPuc6v/tMxxgBAACOfDdZ9QAAAICNEe8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8Ah1Dbq9te2/aadX9ud4DHPL3tXx+sMW7wnC9s+28P5zn3pO3T275k1eMAWAXxDnDoffcY45br/nxklYNpu2mV5z8QM48d4GAQ7wAr0vZ+bd/e9tNtL2l7+rp1T2z73rafbXtl2x9dlh+T5PVJbrd+Jn/XmfFdZ+eXnwD8fNtLk3yu7aZlv1e1/Xjbq9qes8Fxb207ljF+qO2n2j6l7WltL11ez3PXbX9227e1/U9tP9P2fW0fsm797dqe1/aTbf+i7Y+sW/f0tq9s+5K2f5fkKUl+Mcnjltd+yfW9X+vfi7Y/0/Zjbf+m7RPXrT+67f/f9i+X8b217dEbuEZnL+f67PL+PX4j7x/AgTCDAbACbW+f5A+S/GCSP0zykCSvavtNY4yPJ/lYkn+Z5MokD07y+rbvHmNc1PZhSV4yxrjDuuNt5LRnJXlEkk8k+XKS1yb5n8vyOyR5Q9v3jzHO3+DL+JYkJy7jO295HQ9NctMk72n7ijHGm9dt+8okt0nyvUl+r+03jDE+meRlSf48ye2SfFOSP2p75Rjjjcu+j0ryfUmekOTmyzG+cYzxA+vGssf3a1n/9UluneT2Sb4zySvbvmaM8akkz0py9yTfmuR/L2P98vVdoySfT/KcJKeNMd7f9rZJtmzwfQPYb2beAQ691ywzt59u+5pl2Q8ked0Y43VjjC+PMf4oyY4kD0+SMcYfjDE+ONa8OckFSR50gON4zhjjQ2OMa5OcluRrxxjPGGN8cYxxZZLnJTlzH473q2OMvx9jXJDkc0leNsb42Bjjw0n+NMm91237sST/cYzxD2OMlyd5f5JHtD0hyQOT/PxyrIuTPD9rwbzTn40xXrO8T9fubiAbeL/+IckzlvO/Lsk1Se7W9iZJ/u8kTxtjfHiM8aUxxtvHGF/IXq5R1r4Bukfbo8cYfzPG+PN9eO8A9ot4Bzj0Hj3GOHb58+hl2Z2SfN+6qP901iL2tknS9mFt37HcSvLprAXjbQ5wHB9a9/hOWbv1Zv35fzHJ8ftwvI+ue3ztbp7fct3zD48xxrrnf5m1mfbbJfnkGOOzu6y7/R7GvVsbeL/+doxx3brnn1/Gd5skt0jywd0cdo/XaIzxuSSPy9ptPH/T9g+WGXmAQ0q8A6zGh5K8eF3UHzvGOGaM8Wttb57kVVm7neP4McaxSV6XZOe9MWM3x/tcks3rnn/9brZZv9+Hkly1y/lvNcZ4+G72Oxhu36++t+eOST6y/NnS9la7rPvwHsb9T55v4P26Pp9I8vdJ7rKbdXu8Rkkyxjh/jPGdWfuG631Z+8kFwCEl3gFW4yVJvrvtd7U9qu0tlg9W3iHJzbJ2b/fHk1y33ON+xrp9P5rkuLa3Xrfs4iQPb7ul7dcn+am9nP9dSf5u+RDr0csY7tH2tIP2Cr/a1yU5p+1N235fkn+etVtSPpTk7Un+/fIe3CvJk5K89HqO9dEkW5dbXpK9v197NMb4cpIXJPmN5YOzR7W9//INwR6vUdvj2z6yax8g/kLWbsP50j6+JwD7TLwDrMASrY/K2q0qH8/aLO/PJrnJcgvJOUn+R5JPJfn+rH0gdOe+78vahzyvXG7nuF2SFye5JMnVWbvf++V7Of+Xknx3klOSXJW1GejnZ+1DnYfCO7P24dZPJPl3SR47xvjbZd1ZSbZmbRb+1Un+zXJ/+Z68Yvn6t20v2tv7tQHbklyW5N1JPpnk17N2HfZ4jZY/P7OM+ZNJvi3Jj+3DOQH2S7/6FkQAOLjanp3kh8cYD1z1WABmZ+YdAAAmId4BAGASbpsBAIBJmHkHAIBJbFr1AI5kt7nNbcbWrVtXPQwAAG7gLrzwwk+MMb52b9uJ9+uxdevW7NixY9XDAADgBq7tX25kO7fNAADAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExi06oHcCS74qNX5uRnnbnqYQAAcIhdsm37qoewIWbeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmsbJ4bzvavnjd801tP9729/fxOG9qe+ry+Oq2tznYYwUAgCPBKmfeP5fkHm2PXp5/Z5IPr3A8AABwRFv1bTOvT/KI5fFZSV62c0XbY9q+oO27276n7aOW5Ue33d720rYvT3L0rgdtu7Xte9s+r+2ft71g5zcJbb+x7RvaXtL2orZ3OfQvEwAADtyq4317kjPb3iLJvZK8c926f53kj8cYpyX59iTPbHtMkqcm+fwY415J/l2S++7h2Ccm+a0xxt2TfDrJY5blL12Wn5zkW5P8zUF+TQAAcEhsWuXJxxiXtt2atVn31+2y+owkj2y7bXl+iyR3TPLgJM9Zt/+lezj8VWOMi5fHFybZ2vZWSW4/xnj1sv/f77pT2ycneXKS3PTYzfv5ygAA4OBbabwvzkvyrCSnJzlu3fImecwY4/3rN26bJGMDx/3CusdfytrtNd3bTmOMc5OcmySbT9iykfMAAMBhserbZpLkBUmeMca4bJfl5yf5yS613vbey/K3JHn8suweWbvdZkPGGH+X5K/bPnrZ/+ZtTa8DADCFlcf7GOOvxxi/uZtVv5rkpkkubXv58jxJ/kuSWy63y/xcknft4yl/MMk5y/5vT/L1+zdyAAA4vDqGO0P2ZPMJW8aJTztj1cMAAOAQu2Tb9pWev+2FY4xT97bdymfeAQCAjRHvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAk9i06gEcyU46/s7ZsW37qocBAABJzLwDAMA0xDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExi06oHcCS74qNX5uRnnbnqYcD1umTb9lUPAQA4TMy8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABM4oDjve33tB1tv+kAjnH3tn/c9gNt/1fbX27bAx0bAADckByMmfezkrw1yZn7s3Pbo5Ocl+TXxhh3TXJykm9N8mMHYWwAAHCDcUDx3vaWSR6Q5ElZ4r3ty9s+fN02L2z7mLZHtX1m23e3vbTtjy6bfH+St40xLkiSMcbnk/xEkl/YeY62v9P2smW/xyzL/0Xbi9pe0vaNy7Knt9227tyXt926/Hlf299djvHKtpsP5LUDAMDhdqAz749O8odjjA8k+WTb+yTZnuRxSdL2ZkkekuR1WQv8z4wxTktyWpIfafsNSe6e5ML1Bx1jfDDJLdv+syS/vOx3zzHGvZL8cduvTfK8JI8ZY5yc5Ps2MNa7JTl3Ocbfxcw+AACTOdB4PytrsZ7l61lJXp/kO9rePMnDkrxljHFtkjOSPKHtxUnemeS4JCcmaZKxh+OPJA9N8lv/uGCMTyW533Lcq5Zln9zAWD80xnjb8vglSR64u43aPrntjrY7rrvmCxs4LAAAHB6b9nfHtscl+Y4k92g7khyVtdj+uSRvSvJdWZuBf9nOXZL85Bjj/F2Oc8ckD95l2Z2TXDPG+OzywdVd435PwX9dvvobkluse7zr9rv9hmGMcW6Sc5Nk8wlb9vRNBQAAHHYHMvP+2CQvGmPcaYyxdYxxQpKrsjajvT3JE5M8KMnOWD8/yVPb3jRJ2t617TFJXprkgW0fuiw/OslzkvyHZb8LsnYPfJb1X5Pkz5J823LbTdpuWVZfneQ+y7L7JPmGdeO9Y9v7L493fsgWAACmcSDxflaSV++y7FVZ+wDqBVmbTX/DGOOLy7rnJ7kiyUVtL0/yX5NsWm6peVSSX2r7/iSXJXl3kucu+/3bJF+zfPj0kiTfPsb4eJInJ/m9ZdnL151/y3JrzlOTfGDd2N6b5IfaXppkS5L/cgCvHQAADruOccO/M6Tt1iS/P8a4x77st/mELePEp51xSMYEB8sl27bvfSMA4IjW9sIxxql7286/sAoAAJPY7w+szmSMcXWSfZp1BwCAI42ZdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJjEplUP4Eh20vF3zo5t21c9DAAASGLmHQAApiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEptWPYAj2RUfvTInP+vMVQ+DI8Al27aveggAAGbeAQBgFuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmccDx3vaagzGQ6zn+89uetDz+xf3Yf2vbyw/+yAAA4PA64mfexxg/PMa4Ynm6z/EOAAA3FIck3tveqe0b2166fL3jsvyFbZ/T9u1tr2z72GX5Tdr+57Z/3vb3275u3bo3tT217a8lObrtxW1fuuuMetttbZ++PL5v20va/lmSH1+3zVFtn9n23cvYfvRQvH4AADgUDtXM+3OTvGiMca8kL03ynHXrbpvkgUn+ZZJfW5Z9b5KtSe6Z5IeT3H/XA44xfiHJtWOMU8YYj9/L+X8nyTljjF2P86QknxljnJbktCQ/0vYb9uWFAQDAqhyqeL9/kv++PH5x1mJ9p9eMMb683Apz/LLsgUlesSz/30n+ZH9P3PbWSY4dY7x53fl3OiPJE9penOSdSY5LcuIu+z+57Y62O6675gv7OwwAADjoNh2m84x1j9cXcXf5ui+uy1d/83GLdcca/3Tzf1z3k2OM8/d00DHGuUnOTZLNJ2zZ03EAAOCwO1Qz729Pcuby+PFJ3rqX7d+a5DHLve/HJzl9D9v9Q9ubLo8/muTr2h7X9uZZuw0nY4xPJ/lM252z/etvsTk/yVN3HqPtXdsesw+vCwAAVuZgzLxvbvvX657/RpJzkryg7c8m+XiSJ+7lGK9K8pAklyf5QNZuafnMbrY7N8mlbS8aYzy+7TOWba9K8r512z1xOf/nsxbsOz0/a/fWX9S2y9gevaFXCQAAK9Yxjow7Q9recoxxTdvjkrwryQOW+99XZvMJW8aJTztjlUPgCHHJtu2rHgIAcAPW9sIxxql72+5w3fO+Eb/f9tgkN0vyq6sOdwAAONIcMfE+xjh91WMAAIAj2RH/L6wCAABrxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJDategBHspOOv3N2bNu+6mEAAEASM+8AADAN8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJPYtOoBHMmu+OiVOflZZ656GKzYJdu2r3oIAABJzLwDAMA0xDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAEzikMd722v2YdvT237ruudPafuE5fHZbW+3H+e/uu1t9nU/AAA40mxa9QB2cXqSa5K8PUnGGL+9bt3ZSS5P8pHDPioAADgCrCTe2353kl9KcrMkf5vk8UmOTvKUJF9q+wNJfjLJQ7IW81cnOTXJS9tem+T+Sd6b5NQxxifanprkWWOM09sel+RlSb42ybuSdN15fyDJOct535nkx8YYXzr0rxgAAA7cqu55f2uS+40x7p1ke5KfG2NcneS3kzx7jHHKGONPd248xnhlkh1JHr+su/Z6jv1vkrx1OfZ5Se6YJG3/eZLHJXnAGOOUJF/K2jcNAAAwhVXdNnOHJC9ve9uszYJfdRCP/eAk35skY4w/aPupZflDktw3ybvbJmsz/R/bdee2T07y5CS56bGbD+KwAADgwKwq3v9Tkt8YY5zX9vQkT9+PY1yXr/zk4Ba7rBu72b5JfneM8a+u76BjjHOTnJskm0/YsrvjAADASqzqtplbJ/nw8viH1i3/bJJb7WGfXdddnbWZ9CR5zLrlb8lyO0zbhyX5mmX5G5M8tu3XLeu2tL3Tfo4fAAAOu8MR75vb/vW6Pz+dtZn2V7T90ySfWLfta5N8T9uL2z5ol+O8MMlvL+uOTvIrSX5zOcb6D53+SpIHt70oyRlJ/ipJxhhXZO1Dshe0vTTJHyW57cF+sQAAcKh0DHeG7MnmE7aME592xqqHwYpdsm37qocAANzAtb1wjHHq3rbzL6wCAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkNq16AEeyk46/c3Zs277qYQAAQBIz7wAAMA3xDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAk9i06gEcya746JU5+VlnrnoYh8wl27aveggAAOwDM+8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJM4oHhve80uz89u+9wDG9I+nf9r2/5D2x89XOcEAIBVmX3m/fuSvCPJWaseCAAAHGqHLN7b3qntG9teuny947L8hW0fu267a5avt237lrYXt7287YOW5We0/bO2F7V9RdtbrjvNWUl+Jskd2t5+3TGf1PYDbd/U9nk7fxqwzNS/qu27lz8POFSvHwAADrYDjfejl9i+uO3FSZ6xbt1zk7xojHGvJC9N8py9HOv7k5w/xjglyclJLm57myS/lOShY4z7JNmR5KeTpO0JSb5+jPGuJP8jyeOW5bdL8stJ7pfkO5N807pz/GaSZ48xTkvymCTP3/+XDgAAh9emA9z/2iW2k6zd857k1OXp/ZN87/L4xUn+w16O9e4kL2h70ySvGWNc3PbbkpyU5G1tk+RmSf5s2f7MrEV7kmxP8t+S/EaSb07y5jHGJ5cxvSLJXZftHprkpOVYSfLP2t5qjPHZda/hyUmenCQ3PXbzBt4CAAA4PA403vfFWL5el2XGv2sVfbMkGWO8pe2DkzwiyYvbPjPJp5L80Rhjd/e0n5Xk+LaPX57fru2JSbqbbXe6SZL7jzGu3eMgxzg3yblJsvmELWNP2wEAwOF2KD+w+vaszY4nyeOTvHV5fHWS+y6PH5XkpsnaPfJJPjbGeF7WZtHvk7UPoz6g7Tcu22xue9e2d0tyzBjj9mOMrWOMrUn+/XK+dyX5trZf03ZT1m6P2emCJD+x80nbUwIAAJM4lPF+TpIntr00yQ8medqy/HlZi+t3JfmWJJ9blp+etfvc35O14P7NMcbHk5yd5GXLcd6RtXvYz0ry6l3O96okZ40xPpzk/0vyziRvSHJFks+sG9Opy4dor0jylIP6igEA4BDqGDe8O0Pa3nKMcc0y8/7qJC8YY+wa+3u1+YQt48SnnXHwB3iEuGTb9lUPAQCAJG0vHGOcurftZv8973vy9OW331ye5Kokr1nxeAAA4IAdzg+sHjZjjG2rHgMAABxsN9SZdwAAuMER7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACT2LTqARzJTjr+ztmxbfuqhwEAAEnMvAMAwDTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATGLTqgdwJLvio1fm5Gedueph/BOXbNu+6iEAALACZt4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACax3/He9nvajrbftJ/7b217bduL217R9kVtb7q/4wEAgBu6A5l5PyvJW5OceQDH+OAY45Qk90xyhyT/1wEcCwAAbtD2K97b3jLJA5I8KUu8t31524ev2+aFbR/T9qi2z2z77raXtv3RXY83xvhSkncluf2y7y3a/k7by9q+p+2372X52W1f0/a1ba9q+xNtf3rZ5h1ttyzbnbPM8l/advv+vHYAAFiVTfu536OT/OEY4wNtP9n2Pkm2J3lckte1vVmShyR5atYC/zNjjNPa3jzJ29pekGTsPFjbWyT5liRPWxb9eJKMMe653JZzQdu7Xs/yJLlHknsnuUWSv0jy82OMe7d9dpInJPmPSX4hyTeMMb7Q9tj9fO0AALAS+3vbzFlZi/UsX89K8vok37EE+sOSvGWMcW2SM5I8oe3FSd6Z5LgkJy773mVZ/rdJ/mqMcemy/IFJXpwkY4z3JfnLJHe9nuVJ8idjjM+OMT6e5DNJXrssvyzJ1uXxpUle2vYHkly3uxfW9sltd7Tdcd01X9if9wYAAA6JfZ55b3tcku9Ico+2I8lRWZtF/7kkb0ryXVmbgX/Zzl2S/OQY4/xdjrM1yz3vbW+b5E1tHznGOG/ZZ7env56hrS/tL697/uV85XU+IsmDkzwyyS+3vfsY46sifoxxbpJzk2TzCVtGAADgCLE/M++PTfKiMcadxhhbxxgnJLkqa7Pi25M8McmDkuyM9fOTPHXnb5Jpe9e2x6w/4Bjjb7J2S8u/Wha9Jcnjd26f5I5J3n89y/eq7U2SnDDG+JOsfaNxbJJb7vOrBwCAFdmfeD8ryat3WfaqJN+f5IKszWy/YYzxxWXd85NckeSitpcn+a/Z/Yz/a5JsbvugJP85yVFtL0vy8iRnjzG+cD3LN+KoJC9Z9n1PkmePMT69wX0BAGDlOoY7Q/Zk8wlbxolPO2PVw/gnLtnmF+UAANyQtL1wjHHq3rbzL6wCAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkNq16AEeyk46/c3Zs277qYQAAQBIz7wAAMA3xDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAk9i06gEcya746JU5+VlnrnoYuWTb9lUPAQCAI4CZdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHdnnvIAAAARJElEQVQAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASexXvLf9UtuL217e9hVtNx/MQbU9u+1z97LN6W2/dd3zp7R9wsEcBwAAHEn2d+b92jHGKWOMeyT5YpKnHMQxbdTpSf4x3scYvz3GeNEKxgEAAIfFwbht5k+TfGOStP3pZTb+8rY/tSzb2vZ9bX+37aVtX7lzpr7t1W1vszw+te2bdj142+9u+86272n7hrbHt92atW8Y/p/lJwAPavv0ttuWfU5p+47lfK9u+zXL8je1/fW272r7gbYPOgivHwAADosDive2m5I8LMllbe+b5IlJviXJ/ZL8SNt7L5veLcm5Y4x7Jfm7JD+2D6d5a5L7jTHunWR7kp8bY1yd5LeTPHv5CcCf7rLPi5L8/HK+y5L8m3XrNo0xvjnJT+2yHAAAjmj7G+9Ht704yY4kf5XkvyV5YJJXjzE+N8a4JsnvJdk5s/2hMcbblscvWbbdqDskOb/tZUl+Nsndr2/jtrdOcuwY483Lot9N8uB1m/ze8vXCJFt3s/+T2+5ou+O6a76wD8MEAIBDa9N+7nftGOOU9Qva9nq2H3t4fl2+8g3ELfaw739K8htjjPPanp7k6fs21H9iZ5F/Kbt5/WOMc5OcmySbT9iy67gBAGBlDuavinxLkke33dz2mCTfk7X74ZPkjm3vvzw+K2u3wiTJ1Unuuzx+zB6Oe+skH14e/9C65Z9NcqtdNx5jfCbJp9bdz/6DSd6863YAADCbgxbvY4yLkrwwybuSvDPJ88cY71lWvzfJD7W9NMmWJP9lWf4rSX6z7Z9mbSZ8d56e5BXLNp9Yt/y1Sb5n5wdWd9nnh5I8cznfKUmecSCvDQAAjgQd49DeGbL8ZpjfX36t5FQ2n7BlnPi0M1Y9jFyybfuqhwAAwCHU9sIxxql7286/sAoAAJPY3w+sbtjyax2nm3UHAIAjjZl3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmIR4BwCASYh3AACYhHgHAIBJiHcAAJiEeAcAgEmIdwAAmMSmVQ/gSHbS8XfOjm3bVz0MAABIYuYdAACmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGASm1Y9gCPZFR+9Mic/68yVnf+SbdtXdm4AAI48Zt4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHQAAJiHeAQBgEuIdAAAmId4BAGAS4h0AACaxoXhv+z1tR9tv2p+TtN3a9tq2F7e9pO3b295tH4/xwraP3Z/zAwDADcFGZ97PSvLWJGcewLk+OMY4ZYxxcpLfTfKLB3Csf9T2qINxHAAAONLtNd7b3jLJA5I8KUu8t31524ev2+aFbR/T9qi2z2z77raXtv3RPRz2nyX51LLvbvfpmue2vaLtHyT5unXnu7rt/9v2rUm+r+2b2j677VvavrftaW1/r+3/avtvl32OafsHy8z/5W0ft1/vGAAArMimDWzz6CR/OMb4QNtPtr1Pku1JHpfkdW1vluQhSZ6atcD/zBjjtLY3T/K2thckGUnu0vbiJLdKsjnJtyzH39M+905ytyT3THJ8kiuSvGDduP5+jPHAJGn7lCRfHGM8uO3TkvzPJPdN8skkH2z77CSnJ/nIGOMRyz633ud3CwAAVmgjt82clbVYz/L1rCSvT/IdS2w/LMlbxhjXJjkjyROWSH9nkuOSnLjsu/O2mbsk+akk5y7L97TPg5O8bIzxpTHGR5L88S7jevkuz89bvl6W5M/HGH8zxvhCkiuTnLAsf2jbX2/7oDHGZ3b3Yts+ue2Otjuuu+YLG3h7AADg8Ljemfe2xyX5jiT3aDuSHJW1WfSfS/KmJN+VtRn4l+3cJclPjjHO3+U4W3c59HlJfmcv+zx8OdeefG6X5ztL+8vrHu98vmn5ycF9kzw8yb9ve8EY4xm7HnSMcW6Wbyw2n7Dl+s4PAACH1d5m3h+b5EVjjDuNMbaOMU5IclWSB2ZtFv6JSR6UZGd4n5/kqW1vmiRt79r2mN0c94FJPriXfd6S5MzlnvjbJvn2/X6Va8e9XZLPjzFekuRZSe5zIMcDAIDDbW/3vJ+V5Nd2WfaqJN+f5JwkL0py3hjji8u65yfZmuSitk3y8azdM5985Z73Jvlikh/eyz6vztqs/2VJPpDkzfv+8r7KPZM8s+2Xk/xD1u7RBwCAaXQMd4bsyeYTtowTn3bGys5/ybbte98IAIDptb1wjHHq3rbzL6wCAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkxDsAAExCvAMAwCTEOwAATEK8AwDAJMQ7AABMQrwDAMAkNq16AEeyk46/c3Zs277qYQAAQBIz7wAAMA3xDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADAJ8Q4AAJMQ7wAAMAnxDgAAkxDvAAAwCfEOAACTEO8AADCJjjFWPYYjVtvPJnn/qsfBYXGbJJ9Y9SA4LFzrGwfX+cbDtb5xuDFc5zuNMb52bxttOhwjmdj7xxinrnoQHHptd7jWNw6u9Y2D63zj4VrfOLjOX+G2GQAAmIR4BwCASYj363fuqgfAYeNa33i41jcOrvONh2t94+A6L3xgFQAAJmHmHQAAJiHeAQBgEuI9Sdt/0fb9bf+i7S/sZv3N2758Wf/OtlsP/yg5GDZwrR/c9qK217V97CrGyIHbwHX+6bZXtL207Rvb3mkV4+TAbeBaP6XtZW0vbvvWtietYpwcuL1d63XbPbbtaOvXCk5oA3+nz2778eXv9MVtf3gV41ylG328tz0qyW8leViSk5KctZv/uD8pyafGGN+Y5NlJfv3wjpKDYYPX+q+SnJ3kvx/e0XGwbPA6vyfJqWOMeyV5ZZL/cHhHycGwwWv938cY9xxjnJK16/wbh3mYHAQbvNZpe6sk5yR55+EdIQfDRq9zkpePMU5Z/jz/sA7yCHCjj/ck35zkL8YYV44xvphke5JH7bLNo5L87vL4lUke0raHcYwcHHu91mOMq8cYlyb58ioGyEGxkev8J2OMzy9P35HkDod5jBwcG7nWf7fu6TFJ/JaGOW3k/9VJ8qtZ+ybt7w/n4DhoNnqdb9TEe3L7JB9a9/yvl2W73WaMcV2SzyQ57rCMjoNpI9ea+e3rdX5Sktcf0hFxqGzoWrf98bYfzFrUnXOYxsbBtddr3fbeSU4YY/z+4RwYB9VG//v9mOW2x1e2PeHwDO3IId6T3c2g7zozs5FtOPK5jjcOG77ObX8gyalJnnlIR8ShsqFrPcb4rTHGXZL8fJJfOuSj4lC43mvd9iZZu631Zw7biDgUNvJ3+rVJti63Pb4hX7kz4kZDvK99V7f+u7Y7JPnInrZpuynJrZN88rCMjoNpI9ea+W3oOrd9aJJ/neSRY4wvHKaxcXDt69/p7UkefUhHxKGyt2t9qyT3SPKmtlcnuV+S83xodTp7/Ts9xvjbdf/Nfl6S+x6msR0xxHvy7iQntv2GtjdLcmaS83bZ5rwkP7Q8fmySPx7+dasZbeRaM7+9Xuflx+v/NWvh/rEVjJGDYyPX+sR1Tx+R5H8dxvFx8FzvtR5jfGaMcZsxxtYxxtasfZblkWOMHasZLvtpI3+nb7vu6SOTvPcwju+IsGnVA1i1McZ1bX8iyflJjkrygjHGn7d9RpIdY4zzkvy3JC9u+xdZm3E/c3UjZn9t5Fq3PS3Jq5N8TZLvbvsrY4y7r3DY7KMN/p1+ZpJbJnnF8tnzvxpjPHJlg2a/bPBa/8TyU5Z/SPKpfGUihols8FozuQ1e53PaPjLJdVlrsrNXNuAVqQlkAACYg9tmAABgEuIdAAAmId4BAGAS4h0AACYh3gEAYBLiHeAI0vZLbS9ue3nb17Y9dgP7XLOX9ce2/bF1z2/X9pUHYaxb215+oMfZx3Oe0vbhh/OcAEcS8Q5wZLl2jHHKGOMeWfsdxj9+EI55bJJ/jPcxxkfGGI89CMc9rJZ/4fqUJOIduNES7wBHrj9LcvudT9r+bNt3t7207a/sunHbW7Z9Y9uL2l7W9lHLql9LcpdlRv+Z62fM276z7d3XHeNNbe/b9pi2L1jO9551x9qttme3fc3y04Kr2v5E259e9n1H2y3rjv8f2759+enCNy/Ltyz7X7psf69l+dPbntv2giQvSvKMJI9bXsvj2n7zcqz3LF/vtm48v9f+n/buJjSuKgzj+P+hFCrWBvELFLQuupEiCS0uJMRPiiuxFGxB0FJBN+oqqOCilUAKdaHoRhEEF2rV4iK6sGo/CNSKtMQ0oBaFuulGrZIGSYuYx8V9A2OaTG2dphl8fjDM3HPvue95ZzPvPXMuV59K+kHSrpaxPlDf0bikfdV2QflGRFwu//snrEZELEWSlgH30TzhGUkbgDXAHYCAEUkDtkdbup0BNto+Lela4CtJI8DzwFrbvXWu1S19dgMPA9vrseM32j4qaRjYb3tbLd35WtIXtv9oM+y1QB+wAvgReM52n6SXgUeBV+q4K23fKWkAeKv6vQiM2X5I0r00hXpvHb8O6Lc9LWkrsN72U5XLKmCgnsx4PzAMbKp+vTWes8BxSa/Vd/Rm9Tkxe1EBvHAR+UZELLoU7xERS8sVkr4BVgNHgc+rfUO9xmp7JU0x31q8CxiuoniGZtb+hvPE+6BibKcp4j9sifegpMHaXgHcDHzX5lwHbE8BU5ImgY+rfQK4veW49wBsj0paVcVyP1V0294v6RpJPXX8iO3pBWL2AG9LWgMYWN6yb5/tSQBJ3wK3AFcDo7ZPVKzf/kO+ERGLLsV7RMTSMm27twrXT2jWvL9KU5jvtP1Gm76PANcB62z/KeknmiJ0QbZPSjpVy1Q2A0/WLgGbbB+/gLGfbfk807I9wz9/bzx3GBXvnOHVe7vZ7yGai4aN9Y/CwQXG81eNQfPEh4vLNyJi0WXNe0TEElQzxs8Ag5KWA3uBbZJWAki6SdL1c7r1AD9X4X4PzUwzwBRwVZtwu4FngR7bE9W2F3hakipeXyfyKpvrnP3AZOU6SnPxgaS7gV9tn56n79xceoCT9Xnrv4h9GLhL0q0Va3bZzKXMNyKiY1K8R0QsUbbHgHFgi+3PgHeBw5ImgD2cW5C/A6yXdISmEP6+znMKOFQ3iL40T6g9wBaaJTSzhmiWoByrm1uHOpcZv0v6EngdeLzadtTYj9HcYPvYAn0PALfN3rAK7AJ2SjoELDtfYNu/AE8AH0kaB96vXZcy34iIjpE937+HERERnSfpIDBo+8jlHktERDfKzHtERERERJfIzHtERERERJfIzHtERERERJdI8R4RERER0SVSvEdEREREdIkU7xERERERXSLFe0REREREl/gbM4xK+Kv0Wt4AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 864x864 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# plot the variable importance\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "importances = clf.feature_importances_\n",
    "indices = np.argsort(importances)\n",
    "plt.figure(figsize=(12,12))\n",
    "plt.title('Feature Importances')\n",
    "plt.barh(range(len(indices)), importances[indices], color='seagreen', align='center')\n",
    "plt.yticks(range(len(indices)),np.array(dataset.feature_names)[indices])\n",
    "plt.xlabel('Relative Importance');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-30T03:47:06.896882Z",
     "start_time": "2020-03-30T03:47:06.869039Z"
    }
   },
   "outputs": [],
   "source": [
    "selector = SelectFromModel(clf, prefit=True, threshold='median')\n",
    "# threshold is set to 'median', which is the median of the variable importances\n",
    "# which is around 0.0763\n",
    "# we can also set the max_features parameters to select the top features\n",
    "\n",
    "transformed_train = selector.transform(train_set)\n",
    "transformed_test = selector.transform(test_set)\n",
    "assert np.array_equal(transformed_train, train_set[:,[0,5,6,7]]) \n",
    "# select the 1st, 6th, 7th and 8th features\n",
    "assert np.array_equal(transformed_test, test_set[:,[0,5,6,7]]) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "690.469px",
    "left": "20.7031px",
    "top": "136.82px",
    "width": "312.969px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
